MySQL – Best Collation for MySQL Tables

character encodingMySQL

I'm curious what is considered the standard today for use as the Collation of MySQL Tables?

I was told that Latin-1 was the best choice when I was beginning with MySQL, but came across this post from 2009.

It states that US-ASCII and LATIN-1 Character Sets are on the way out, and that UTF-8 is on the way in and should be made the default by MSQ version 6.

Does this gel with what you know, or with what you think, or with what you do?

For this question, let's assume there is no specific need for non-english characters.

Best Answer

UTF-8 is becoming more and more common virtually everywhere (W3Techs in June 2015 puts UTF-8 at 84.3% on the web), has no storage penalty over US-ASCII for the US-ASCII range (U+0000 through U+007F) (depending on the implementation the BOM may carry a three-byte penalty, but a BOM is only needed if the encoding and/or byte order is not otherwise known, which it would be in this case), and it can represent the full range of Unicode so it will future-proof your application character-set-wise at no extra cost if you don't use that capability. In summary, I see no reason not to use UTF-8 encoding these days, particularly if your choices are between UTF-8 and US-ASCII. And even in a US-only world, I would be very wary of saying that there will be "no need" to encode any letters outside of the English alphabet.

I'm pretty sure I saw a RFC from a few years back that stated that UTF-8 is the new Internet standard character set (replacing US-ASCII), but then I couldn't find it again. However, BCP 18 (RFC 2277) section 3.1, What charset to use, comes close in stating that, in part:

All protocols MUST identify, for all character data, which charset is in use.

Protocols MUST be able to use the UTF-8 charset, which consists of the ISO 10646 coded character set combined with the UTF-8 character encoding scheme, as defined in [10646] Annex R (published in Amendment 2), for all text.

Protocols MAY specify, in addition, how to use other charsets or other character encoding schemes for ISO 10646, such as UTF-16, but lack of an ability to use UTF-8 is a violation of this policy; such a violation would need a variance procedure ([BCP9] section 9) with clear and solid justification in the protocol specification document before being entered into or advanced upon the standards track.