Database Configuration – Should Latin-1 Be Used Over UTF-8?

asciidatabaseMySQLruby-on-railsutf-8

We are using MySQL at the company I work for, and we build both client-facing and internal applications using Ruby on Rails.

When I started working here, I ran into a problem what I had never encountered before; the database on the production server is set to Latin-1, meaning that the MySQL gem throws an exception whenever there is user input where the user copies & pastes UTF-8 characters.

My boss calls these "bad characters" since most of them are non-printable characters, and says that we need to strip them out. I've found a few ways to do this, but eventually we've ended up in a circumstance where a UTF-8 character was needed. Plus it's a bit of a hassle, especially since it seems like the only solution I ever read about for this issue is to just set the database to UTF-8 (makes sense to me).

The only argument that I've heard for sticking with Latin-1 is that allowing non-printable UTF-8 characters can mess up text/full-text searches in MySQL. Is this really true?

Are there other reasons one should use Latin-1 over UTF-8? It's my understanding that it is superior and becoming more ubiquitous.

Best Answer

Unicode is certainly difficult, and the UTF-8 encoding has a couple of inconvenient properties. However, UTF-8 has become the de-facto standard encoding on the web, surpassing ASCII, Latin-1, UCS-2 and UTF-16. Just use UTF-8 everywhere.

The most important reason why you should support Unicode is that you shouldn't make unnecessary assumptions about user input. I have no idea what your domain is, but things like Hebrew usernames, a blog post about China, a comment with Emoji, or simply well styled text – like “this” – should be possible… Oh, those were typographically correct quotation marks (“” rather than ""), en-wide dashes, and an ellipsis, which are characters that are common in English text, but not supported by ASCII or Latin-1. So not supporting other scripts isn't just a big f*ck you to other cultures, but sticking to Latin-1 doesn't even allow you to write proper English.

The notion that Unicode only allows “bad characters” is wrong. Yes, text is really complicated, and Unicode won't hide that from you. Your boss may be thinking about composed characters, where one base codepoint such as a is modified by subsequent codepoints that e.g. represent diacritics to form one visual character such as á. This doesn't really get into your way when trying to do searches if you do some kind of normalization. For example, you could store all text in the NFC form which collapses such compositions into their precomposed form if one is available. When doing searching, you could also strip all composing characters from the text, but this may substantially change their meaning in some languages.

Unicode also adds a lot of unprintable characters – but even ASCII has loads of them. Will you handle a NUL in the middle of a string? How about 0x1C, a “File Separator”? I've never seen half of those. Latin-1 adds a soft hyphen that indicates word break opportunities, but is otherwise invisible. Does that also break your full-text search? In other words, even ASCII and Latin-1 allow you to completely break your input if you assume it's all just printable text!