Lets say I have a string:
char theString[] = "你们好āa";
Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:
strlen(theString) == 12
How can I count the number of characters? How can i do the equivalent of subscripting so that:
theString[3] == "好"
How can I slice, and cat such strings?
Best Answer
You only count the characters that have the top two bits are not set to
10
(i.e., everything less that0x80
or greater than0xbf
).That's because all the characters with the top two bits set to
10
are UTF-8 continuation bytes.See here for a description of the encoding and how
strlen
can work on a UTF-8 string.For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a
0
bit or a11
sequence is the start of a UTF-8 code point, all others are continuation characters.Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:
to get, respectively:
sz
UTF-8 bytes of a string.sz
UTF-8 bytes of a string, starting atpos
.pos
.This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.