Unicode is certainly difficult, and the UTF-8 encoding has a couple of inconvenient properties. However, UTF-8 has become the de-facto standard encoding on the web, surpassing ASCII, Latin-1, UCS-2 and UTF-16. Just use UTF-8 everywhere.
The most important reason why you should support Unicode is that you shouldn't make unnecessary assumptions about user input. I have no idea what your domain is, but things like Hebrew usernames, a blog post about China, a comment with Emoji, or simply well styled text – like “this” – should be possible… Oh, those were typographically correct quotation marks (“”
rather than ""
), en-wide dashes, and an ellipsis, which are characters that are common in English text, but not supported by ASCII or Latin-1. So not supporting other scripts isn't just a big f*ck you to other cultures, but sticking to Latin-1 doesn't even allow you to write proper English.
The notion that Unicode only allows “bad characters” is wrong. Yes, text is really complicated, and Unicode won't hide that from you. Your boss may be thinking about composed characters, where one base codepoint such as a
is modified by subsequent codepoints that e.g. represent diacritics to form one visual character such as á
. This doesn't really get into your way when trying to do searches if you do some kind of normalization. For example, you could store all text in the NFC form which collapses such compositions into their precomposed form if one is available. When doing searching, you could also strip all composing characters from the text, but this may substantially change their meaning in some languages.
Unicode also adds a lot of unprintable characters – but even ASCII has loads of them. Will you handle a NUL in the middle of a string? How about 0x1C, a “File Separator”? I've never seen half of those. Latin-1 adds a soft hyphen that indicates word break opportunities, but is otherwise invisible. Does that also break your full-text search? In other words, even ASCII and Latin-1 allow you to completely break your input if you assume it's all just printable text!
Best Answer
You need to ensure that the connection between the web application and the database doesn't mangle the encoding (I believe you need to explicitly set this on the connection string for MySQL, for instance).
Basically you need to ensure that every step in the chain is using the same encoding.