Author Topic: text / strings in C (Read 11656 times)

IanB · « **Reply #25 on:** March 15, 2020, 04:40:51 pm »

Just a sideways comment here, but C is one of the few languages that really does support strings. To justify this statement, consider what "string" means: it is a string of bytes (or characters) in memory. (A string being a sequence of things lined up one after the other.)

Other languages with a String datatype are not really providing strings as such, they are providing a text datatype where a block of text can be treated as a single object.

Simon · « **Reply #26 on:** March 15, 2020, 06:24:59 pm »

Well I am only sending and ultimately it will be interrupt driven so I use the buffer empty interrupt to load the next character, minimal overhead. The little 2x16 screen I want to use for testing will take µs to refresh if I were to run at the declared 2MHz spi clock but for a wired application maybe a bit slower.

Nominal Animal · « **Reply #27 on:** March 16, 2020, 01:04:42 pm »

As a sideways comment to those who are interested in such details -- wall of text follows:

C actually has two string types: ordinary strings, and wide strings.

They are both simply unspecified-length arrays, terminated by a zero value ('\0' and L'\0', respectively).
(Because it can be unclear whether by zero one means code point zero or the zero digit character, I like to call this value nul. In comparison, the zero pointer value I call NULL, with the length of the final consonant separating them in everyday speech.)

For ordinary strings, each character in the array is of type char, but because of the integer promotion rules in C, in expressions literal character constants like 'X' are promoted to ints.

For wide strings, each character in the array is of type wchar_t. Unlike ordinary string literals, the type wint_t is not related to integer promotion at all, and is just a type that can hold any wchar_t value, plus the WEOF value (indicating end-of-stream for wide character streams).

C11 added support for specifying Unicode code points in both ordinary and wide character constants and string literals, using \uHHHH or \UHHHHHHHH, where HHHH and HHHHHHHH are the code point in hexadecimal.

The exact character set used for ordinary and wide character constants and string literals is a bit of a complex issue. In practice nowadays, the ordinary character set is ASCII compatible, either UTF-8 or one of the 8-bit ASCII-compatible character sets. The character set used for wide characters is even messier, partly because the Microsoft C library used in Windows uses UTF-16, where some glyphs can require more than one wide character; I'm not exactly clear which Windows versions and libraries actually support that, and which are limited to the first 65536 characters of the Unicode set.

For POSIXy systems -- that means Linux, *BSDs, Mac OS, Android, and some other esoteric systems -- the C library provides iconv conversion facilities. It can basically convert, at run time, between various character sets (using ordinary strings), and to/from wide character strings, using a very simple but efficient conversion interface.

Standard C also contains wide character equivalents of the typical I/O functions -- wprintf()/printf(), wfprintf()/fprintf(), wscanf()/scanf(), wcslen()/strlen() -- and so on. (The only thing that is missing is the wide character equivalents of POSIX getline() and getdelim(), really; you have to roll your own for those.)

But to be most practical, we should just use UTF-8 everywhere.
(This is most important when dealing with internet-of-things gadgets and such.)

As an example, if you write a Linux/Mac/BSD/POSIXy program that states that it only works in UTF-8 locales, and the sources use UTF-8 encoding, you can use ordinary string literals that contain non-ASCII characters like "Öbaut 2.50€", and they will work fine. What will not work, however, is single-character non-ASCII literals like '€' or 'Ö', unfortunately. This is because non-ASCII characters in UTF-8 are composed of 2 to 6 chars. However, if you write your code to consider substrings instead of single character constants, it is not a problem at all.

I do personally have a bit of a chip on my solder about Microsoft wrt. C11 and getline() and wide-character support. If MS hadn't made the mistake of assuming early on that 65536 characters would be enough for everyone (Unicode has 1,114,112 code points), we could have proper Unicode support standardized for C wide characters now, with widget and file system access libraries having wide character interfaces. But enough of that: the world is what it is, and it is much better to be practical and robust, and forget whining about what could be. Sorry about that.

In practice, you have two robust approaches to choose from, depending on where the C code you are working on will operate in.

Specify the character set the code uses. For some minimal gadgets that could be ASCII, but in general, UTF-8 is used (as it supports all Unicode code points, and therefore the vast majority of written human languages).
Use the user locale for I/O character set, and the iconv facilities to convert to the internal character set, typically either wide characters, or UTF-8.

If anyone is interested enough, I'd be happy to provide some example code for the various cases; just let me know of a specific situation you'd like to see.

(Full disclosure: I first encountered this problem in late nineties, when implementing a localized web form for course feedback reports for students, using Windows, Mac (pre-OS X), and Linux machines. Internet Explorer in particular used the character set in current user locale for non-ASCII characters, regardless of the form data. So, I developed hidden form fields with specific detector characters, to detect the actual character set the browser used for the input fields. I have worked on character set and localization issues a lot, in other words.)

Nominal Animal · « **Reply #28 on:** March 16, 2020, 01:40:00 pm »

A second wall of text: why terminated strings instead of netstrings (aka Pascal-style length-first-then-data)?

TL;DR: Because buffers and bufferbloat.

This matters a lot when you are designing your own custom protocol to talk with a computer or network-attached gadget.

There are two basic structures one can construct protocols on: size-structured, or stream-like.

File formats like PNG and WAV are size-structured, with each field being either fixed size (in bytes), or associated with the explicit length of that field.
File formats like HTML and XML, and network protocols like HTTP, are stream-like; delimited by characters or strings, without explicit lengths for particular fields.

Note that being stream-like does not mean unstructured. XML is most definitely a structured format.

In general terms, to process (send, or receive and handle) a complete field in a size-structured format, one needs to hold the entire field in memory at the same time. (There are exceptions, particularly when the field itself can be decomposed into fixed- or known-size subfields, in which case one only needs to hold the subfield in memory at once.)

In comparison, stream formats can be, and often are, processed using a finite-state machine. The amount of RAM needed to process a stream format is typically a bit of state (a pointer or two per nesting level, up to maximum allowed depth is typical) and the length of the longest value field that cannot be processed as it arrives character-by-character. In particular, structured input with known named fields and numerical values can often be processed using an FSM that parses/converts the numeric data on the fly ("on-line algorithm"), with very little RAM use.

Both types have their downsides with respect to error detection and correction. Checksums are often used, included into the protocol format, sometimes as an optional feature. I have seen many protocols with optional checksum support in both size-structured and stream-like protocols, and cannot really say there is any difference between the two wrt. checksumming; it is just something that has to be considered up front, and is very difficult to add on afterwards as an afterthought.

In stream-like protocols, numeric values are often in decimal, or a similar variable-length basis. Base64 and Base85 are particularly common in protocols used on top of ASCII-compatible character sets. In size-structured protocols, numeric data is usually in raw binary form.

Now, raw binary numbers have their own issue: byte order. When a numeric value consists of multiple bytes, the order of those bytes needs to be specified and accounted for. Currently, the two most used formats are big-endian (most significant byte first, then the others in decreasing order of significance), and little-endian (the inverse of big-endian). The mixed byte orders like PDP-endian are rare to nonexistent. While the most used desktop and server processors (made by Intel and AMD) are little-endian, a number of microcontrollers are big-endian, and this means the byte order must be considered at both ends. In the "worst case", there are two conversions, wasting a bit of time. With IOT, the amount of data transferred is so small that the conversion time is completely irrelevant.

For floating-point and custom integer types, in addition to byte order, the exact binary format must also be specified somehow. Currently, most microcontrollers and DSPs use IEEE-754 binary32 and binary64 formats (typically corresponding to float and double in C), or at least the conversion between these and whatever internal format they might use.

For typical IOT devices, the overhead of byte order conversion or binary format conversion is completely neglible, due to the relatively small amount of data transferred. I personally work with simulations, that generate megabytes to gigabytes of data, and there the conversion starts to matter. As a solution, I developed a file format with prototype values in the header for each numeric type used, with the reader being responsible for the conversion. If both the writer and the reader use the same byte order (and binary floating-point format), no conversion is necessary.
This boils down to picking prototype numeric values whose bit patterns are easily distinguished in different byte orders. (I've never seen anything else than IEEE-754 binary32 and binary64, in either big-endian or little-endian byte order, so I'd say picking values with each byte having a different bit pattern is good enough.) Plus, you want to use values you can specify exactly in decimal, so that you can express those values in any programming language.

In practice, this means that if you intend to build any data acquisition device, or similar wide bandwidth device producing lots of numeric data, the byte order deserves a bit of thought. (As of this writing, "use little-endian" is the simple robust answer; I am just pointing out that in some cases you might arrive at a different answer, like I did wrt. MD simulator output.)

Jan Audio · « **Reply #29 on:** March 17, 2020, 02:12:16 pm »

Why would you read back what you have written ?

Simon · « **Reply #30 on:** April 12, 2020, 10:52:54 am »

Well I now have my SPI port communicating and I am reading what it put out with my oscilloscope.

While I can create an array and assign a string to it on declaration I cannot put a string into the array later on, I get:

Code: [Select]

Severity	Code	Description	Project	File	Line
Warning		assignment makes integer from pointer without a cast [-Wint-conversion]

So it looks like i will need to have slightly less readable code and assign one letter at a time or just come up with a string to array converter function.

grumpydoc · « **Reply #31 on:** April 12, 2020, 11:21:53 am »

Quote

I cannot put a string into the array later on

No, you can't say eg

Code: [Select]

unsigned char foo[] = "Hello World";

and later say

Code: [Select]

foo = "Goodbye Cruel World";

nor say (which is what I think you might be trying from the error message)

Code: [Select]

foo[0] = "Goodbye Cruel World";

For one thing these are not the same - one is initialisation, the other assignment and "foo" is not an l-value so you can't assign it.

This is where strcpy and the other str* functions come in.

However it would *still* be an error to say

Code: [Select]

strcpy(foo, "Goodbye Cruel World");

Even though that might compile because when initialised only 12 bytes were allocated to foo (including the \0 terminator).

Could you post the code of what you are trying?

Simon · « **Reply #32 on:** April 12, 2020, 11:24:44 am »

int8_t text[16] ;
int8_t text1[5] = "hello";
int8_t text2[4] = "you";
text[0] = text1;
text[5] = text2;

Yansi · « **Reply #33 on:** April 12, 2020, 11:28:37 am »

that's what STRCPY is for.

Using signed int for characters is a bad practice too. Either use the basic CHAR, or uint8_t (if wanna have he hassle to typecast every time)

And least but not last, if using text constants like this:

int8_t text1[5] = "hello";

declare them as CONST if they are supposed to be const. And btw, you are missing a character there for the terminating null byte. This wont compile, but this code below will:

int8_t text1[6] = "hello";

grumpydoc · « **Reply #34 on:** April 12, 2020, 11:31:36 am »

Quote from: Simon on April 12, 2020, 11:24:44 am

int8_t text[16] ;
int8_t text1[5] = "hello";
int8_t text2[4] = "you";
text[0] = text1;
text[5] = text2;

Hmmm text1 does not have enough space allocated (you forgot to allow for the '\0', though you got it right for text2)

should have been

Code: [Select]

int8_t text[16] ;
int8_t text1[6] = "hello";
int8_t text2[4] = "you";
strcpy(text, text1);
strcpy(text+5, text2);

That gets "helloyou" in text;

strncpy is safer than strcpy - but watch out for it not copying the '\0' if you are not careful.

Yansi · « **Reply #35 on:** April 12, 2020, 11:32:57 am »

STRCPY should also be avoided, as it does not check for memory overflow.

Use STRNCPY instead, that checks number of copied characters is within the limits of the destination memory.

Also, STRxxx family of functions work with CHARs, so you should typecast your nonstandard integer texts:

strcpy((char*)text, (char*)text1);

and for string concatenation there is the STRCAT

Simon · « **Reply #36 on:** April 12, 2020, 11:34:20 am »

yes I did use the signed type, thought the name of the type was shorter than usual.

grumpydoc · « **Reply #37 on:** April 12, 2020, 11:40:16 am »

Quote from: Yansi on April 12, 2020, 11:32:57 am

STRCPY should also be avoided, as it does not check for memory overflow.

Use STRNCPY instead, that checks number of copied characters is within the limits of the destination memory.

strncpy has its own "gotcha" in that it does not copy the '\0' if it "thinks" the destination string is full.

strcpy is fine IF you cave already checked lengths

Agree with the comments about types though.

Simon · « **Reply #38 on:** April 12, 2020, 12:43:54 pm »

my problem seems to be trying to assign the contents of one array to another, i take it I need to cycle through and assign each index location to another.

grumpydoc · « **Reply #39 on:** April 12, 2020, 01:04:09 pm »

Quote from: Simon on April 12, 2020, 12:43:54 pm

I take it I need to cycle through and assign each index location to another.

Yes, there are functions in the standard library to help though.

Jeroen3 · « **Reply #40 on:** April 12, 2020, 02:19:40 pm »

Quote from: Simon on April 12, 2020, 11:24:44 am

int8_t text[16] ;
int8_t text1[5] = "hello";
int8_t text2[4] = "you";
text[0] = text1;
text[5] = text2;

What are you even trying to do here. It isn't javascript or python.

Quote from: Simon on April 12, 2020, 12:43:54 pm

my problem seems to be trying to assign the contents of one array to another, i take it I need to cycle through and assign each index location to another.

C can't do that for you. Although, the libs offer memcpy for you. C is stupid. That's why it's amazing, and hard to use.

Code: [Select]

// assign the contents of one array to another
int8_t array_a[16];
int8_t array_b[16];
memcpy(array_b, array_a, sizeof(array_a))

Code: [Select]

// assign the contents of one string to another
char string_a[16] = "from here";
char string_b[16] = "to here";
strncpy(string_b, string_a, sizeof(array_a))

Concatenating two strings, regardless of either length... Without risk of writing outside of your destination.

Code: [Select]

char string_a[6] = "hello";
char string_b[6] = "world";
char text[32] = "both strings: %s %s";
char result[32];
snprintf(result, sizeof(result), text, string_a, string_b);
printf(result);

Outputs: "both strings: hello world";

Simon · « **Reply #41 on:** April 12, 2020, 03:10:56 pm »

Well at the end of the day I don't care for strings, they are just a convenient way to write code that puts ASCII into an array that I can pump out over a serial port. The null character would need removing anyway as otherwise it would show up in the middle of the text. So if i put a string into an array that does not have space for the last null character do I cause issues for the program as a whole as the memory location after the end of the array gets affected or does it mean I loose the null character as intended?

i can then write a function that puts a certain amount of variables from one string into another to fill my display buffer with what is required.

radar_macgyver · « **Reply #42 on:** April 12, 2020, 05:37:47 pm »

Quote from: SiliconWizard on March 15, 2020, 02:26:14 am

Quote from: greenpossum on March 14, 2020, 09:06:37 pm
Indexing into an array is more expensive than dereferencing a pointer, because of the addition of the index to the base address

Although it was true for pretty old architectures (or more recent but limited 8-bitters) in the general case, it's certainly not for most modern targets. Indexing comes at no cost as most modern ISAs have memory access instructions with base and offset. If anything, depending on the ISA, that could cost an extra register though, but only that. And if the indexing is linear in a loop, the compiler will select whatever is most efficient between that and just incrementing the base address register anyway. Either way, it's just incrementing a register, same cost.

Sometimes it helps to look at the output from the compiler to decide if that's the case. Here's a nice tool to do just that:

https://godbolt.org/z/Z-g6W3

So at least on ARM gcc, the two are equivalent. On x86 it seems like the indexed array method produces more code. My assembler skills are somewhat rusty so I can't tell if it's actually more efficient or not. Once optimizations are turned on (even -O1 or -Os), both produce exactly the same code.

Simon · « **Reply #43 on:** April 12, 2020, 05:58:13 pm »

Yea the K&R book on C says that pointers are faster. From what i can tell I need to use pointers anyway to bring the data into my function. I would have to put the strings into an array as nothing else will hold that sort of data.

Jeroen3 · « **Reply #44 on:** April 12, 2020, 06:03:25 pm »

Quote from: Simon on April 12, 2020, 03:10:56 pm

So if i put a string into an array that does not have space for the last null character do I cause issues for the program as a whole as the memory location after the end of the array gets affected or does it mean I loose the null character as intended?

It will just keep reading memory until the first null.
Usually a segmentation fault will happen. Or until a bus error for embedded bare metal devices.

Quote from: Simon on April 12, 2020, 03:10:56 pm

i can then write a function that puts a certain amount of variables from one string into another to fill my display buffer with what is required.

You don't have to, the libs provide the printf family.

Simon · « **Reply #45 on:** April 12, 2020, 06:06:26 pm »

Quote from: Jeroen3 on April 12, 2020, 06:03:25 pm

Quote from: Simon on April 12, 2020, 03:10:56 pm
i can then write a function that puts a certain amount of variables from one string into another to fill my display buffer with what is required.
You don't have to, the libs provide the printf family.

i am talking about setting up my array to act as a message holder or display ram buffer and then copy into that the text to appear on the screen. So words and variable will get copied into locations to build up the screen contents that is then refreshed.

Yansi · « **Reply #46 on:** April 12, 2020, 06:45:30 pm »

That is when SNPRINTF or VSNPRINTF indeed comes in handy.

Yansi · « **Reply #47 on:** April 12, 2020, 06:58:38 pm »

For example to make a custom PRINTF-like function to spit data on UART:

Code: [Select]

void USART_Printf(const char *fmt, ...);

/* Classy printf implementation for USART using variable argument */
void USART_Printf(const char *fmt, ...)
{	
	va_list ap; /* variable argument list */
	char s[MAX_PRINTF_LEN]; /* string */
	char *ps; /* pointer to string */
	
	va_start(ap, fmt); 
	
	/* Print to string s */
	vsnprintf(s, MAX_PRINTF_LEN, fmt, ap); 
	
	va_end(ap);
	
	ps = s;
	
	/* push the string out through the USART */ 
	while (*ps) {
		
		/* wait transmit empty */
	  while (!(USART2->SR & USART_SR_TXE)) {}
		
    /* tx the char */			
	  USART2->DR = *ps++;		
	}
}

...and don't forget to include also stdarg.h!

Or an example how to declare a LCD printing function, that includes also the position of the text, example of use below:

Code: [Select]

//declaration:
void LCD_Printf(uint8_t pos_x, uint8_t pos_y, const char *fmt, ...);

//example function call printing a text, variable and a unit of measurement:
LCD_Printf(10,2, "Vbat = %.2f V", batt_volts);

Variable argument list is handled same as above example.

Simon · « **Reply #48 on:** April 12, 2020, 07:03:44 pm »

Ah, turns out that as arrays are practically pointers I can pass the array names and let the function work on them.

Yansi · « **Reply #49 on:** April 12, 2020, 07:05:15 pm »

Yes, the array identifier is a pointer. See my example with vararg above. I have specially added the line of ps = s; so you see the array is a pointer. Compiles without warnings.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: text / strings in C (Read 11656 times)

Share me