How to count characters in a unicode string in C

asciicstringunicode

Lets say I have a string:

char theString[] = "你们好āa";

Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:

strlen(theString) == 12

How can I count the number of characters? How can i do the equivalent of subscripting so that:

theString[3] == "好"

How can I slice, and cat such strings?

Best Answer

You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).

That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.

See here for a description of the encoding and how strlen can work on a UTF-8 string.

For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.

Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:

utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;

to get, respectively:

the left sz UTF-8 bytes of a string.
the sz UTF-8 bytes of a string, starting at pos.
the rest of the UTF-8 bytes of a string, starting at pos.

This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.

This is the style that Microsoft tends to use in their examples.

It appears that the guidance in this area may have changed, as StyleCop now enforces the use of the C# specific aliases.

C++ – How to iterate over the words of a string

I use this to split string by a delimiter. The first puts the results in a pre-constructed vector, the second returns a new vector.

#include <string>
#include <sstream>
#include <vector>
#include <iterator>

template <typename Out>
void split(const std::string &s, char delim, Out result) {
    std::istringstream iss(s);
    std::string item;
    while (std::getline(iss, item, delim)) {
        *result++ = item;
    }
}

std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, std::back_inserter(elems));
    return elems;
}

Note that this solution does not skip empty tokens, so the following will find 4 items, one of which is empty:

std::vector<std::string> x = split("one:two::three", ':');

Best Answer

Related Solutions

C# – the difference between String and string in C#

This is the style that Microsoft tends to use in their examples.

C++ – How to iterate over the words of a string

Related Topic