To expand on the answers others have given:
We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.
Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.
Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).
But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work.
There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as the ASCII and ISO-8859 standards, as their value range is still limited, even if the limit is vastly higher.
The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.
Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).
Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that's your relationship between them.
Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performance when representing all living languages.
The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 became the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32.
Hope that fills in some details.
Best Answer
Delphi versions prior to Delphi 2009 do have Unicode support built in. The
WideString
type has been available since Delphi 4, I think, maybe earlier.WideString
isn't as nice as the newUnicodeString
type, but it still holds 16-bit Unicode characters, and you can type-cast it toPWideChar
to send strings to Unicode API functions. TheWindows
unit declares most of the "wide" versions of the API functions, and there's nothing to stop you from declaring other functions yourself if you find some missing.What prior versions don't have is Unicode support in the VCL. For that, you can use the Tnt Unicode controls. They used to be free. Looks like there are a few places where the latest free version is still available: (1), (2).
The JCL has a couple of units for working with Unicode. The
JclWideStrings
unit has mostly light-weight utility functions. TheJclUnicode
unit is more complete, but it also includes a sizable resource for determining character properties of all Unicode characters.With the JCL you have a few choices for classes to hold lists of
WideString
values. I think Delphi 7 even comes with a class for that.Don't think that just because you don't have Delphi 2009 you can't write a Unicode program.
If you have a
WideString
value, and you want to encode it as UTF-8, then call theUtf8Encode
function. It will return anAnsiString
value, or possiblyUtf8String
, if your Delphi version declares that type. It's not the same as Delphi 2009'sUtf8String
type, though. Delphi 2009's will automatically convert toUnicodeString
orAnsiString(x)
and vice versa in assignment statements. Prior versions just have a singleAnsiString
type, so you need to keep track for yourself which variables hold UTF-8 data and which hold Ansi data. (Hungarian notation on your variable and parameter names can help you keep track.) And of course, there's also aUtf8Decode
function for converting UTF-8 data back toWideString
.For handling other character encodings, you want to check out Open XML, a free XML library for Delphi. As part of its XML handling, it has support for converting between 70 different encodings.