To expand on the answers others have given:
We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.
Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.
Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).
But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work.
There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as the ASCII and ISO-8859 standards, as their value range is still limited, even if the limit is vastly higher.
The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.
Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).
Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that's your relationship between them.
Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performance when representing all living languages.
The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 became the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32.
Hope that fills in some details.
For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example, utf8mb4_0900_ai_ci
.
All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared.
_unicode_ci
and _general_ci
are two different sets of rules for sorting and comparing text according to the way we expect. Newer versions of MySQL introduce new sets of rules, too, such as _0900_ai_ci
for equivalent rules based on Unicode 9.0 - and with no equivalent _general_ci
variant. People reading this now should probably use one of these newer collations instead of either _unicode_ci
or _general_ci
. The description of those older collations below is provided for interest only.
MySQL is currently transitioning away from an older, flawed UTF-8 implementation. For now, you need to use utf8mb4
instead of utf8
for the character encoding part, to ensure you are getting the fixed version. The flawed version remains for backward compatibility, though it is being deprecated.
Key differences
utf8mb4_unicode_ci
is based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages.
utf8mb4_general_ci
is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed. It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations, such as when using particular languages or characters.
On modern servers, this performance boost will be all but negligible. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.
Benefits of utf8mb4_unicode_ci
over utf8mb4_general_ci
utf8mb4_unicode_ci
, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.
As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_ci
sorting in MySQL, but there are still a few differences:
For examples, the Unicode collation sorts "ß" like "ss", and "Œ" like "OE" as people using those characters would normally want, whereas utf8mb4_general_ci
sorts them as single characters (presumably like "s" and "e" respectively).
Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. utf8mb4_unicode_ci
handles these properly.
In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8mb4_general_ci
sorting. The suitability of utf8mb4_general_ci
will depend heavily on the language used. For some languages, it'll be quite inadequate.
What should you use?
There is almost certainly no reason to use utf8mb4_general_ci
anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by other bottlenecks than this.
In the past, some people recommended to use utf8mb4_general_ci
except when accurate sorting was going to be important enough to justify the performance cost. Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously.
There's an argument to be made that if speed is more important to you than accuracy, you may as well not do any sorting at all. It's trivial to make an algorithm faster if you do not need it to be accurate. So, utf8mb4_general_ci
is a compromise that's probably not needed for speed reasons and probably also not suitable for accuracy reasons.
One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.
What the parts mean
Firstly, ci
is for case-insensitive sorting and comparison. This means it's suitable for textual data, and case is not important. The other types of collation are cs
(case-sensitive) for textual data where case is important, and bin
, for where the encoding needs to match, bit for bit, which is suitable for fields which are really encoded binary data (including, for example, Base64). Case-sensitive sorting leads to some weird results and case-sensitive comparison can result in duplicate values differing only in letter case, so case-sensitive collations are falling out of favor for textual data - if case is significant to you, then otherwise ignorable punctuation and so on is probably also significant, and a binary collation might be more appropriate.
Next, unicode
or general
refers to the specific sorting and comparison rules - in particular, the way text is normalized or compared. There are many different sets of rules for the utf8mb4 character encoding, with unicode
and general
being two that attempt to work well in all possible languages rather than one specific one. The differences between these two sets of rules are the subject of this answer. Note that unicode
uses rules from Unicode 4.0. Recent versions of MySQL add the rulesets unicode_520
using rules from Unicode 5.2, and 0900
(dropping the "unicode_" part) using rules from Unicode 9.0.
And lastly, utf8mb4
is of course the character encoding used internally. In this answer I'm talking only about Unicode based encodings.
Best Answer
At least in Delphi 2009 (can't test in version 2010 as I don't have it) the
StrFormatByteSize()
function is an alias to the Ansi version (StrFormatByteSizeA()
), not to the wide char version (StrFormatByteSizeW()
) as it is for most of the other Windows API functions. Therefore you should use the wide char version directly - also for earlier Delphi versions, to be able to work with file (system) sizes larger than 4 GB.There's no need for an intermediate buffer, and you can make use of the fact that
StrFormatByteSizeW()
returns a pointer to the converted result as aPWideChar
: