R – Are you fluent in Unicode yet

asciiinternationalizationlanguage-agnosticunicode

Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

Like many, I read it carefully, realizing it was high-time I got to grips with this "replacement for ASCII". Unfortunately, 5 years later I feel I have slipped back into a few bad habits in this area. Have you?

I don't write many specifically international applications, however I have helped build many ASP.NET internet facing websites, so I guess that's not an excuse.

So for my benefit (and I believe many others) can I get some input from people on the following:

  • How to "get over" ASCII once and for all
  • Fundamental guidance when working with Unicode.
  • Recommended (recent) books and websites on Unicode (for developers).
  • Current state of Unicode (5 years after Joels' article)
  • Future directions.

I must admit I have a .NET background and so would also be happy for information on Unicode in the .NET framework. Of course this shouldn't stop anyone with a differing background from commenting though.

Update: See this related question also asked on StackOverflow previously.

Best Answer

Since I read the Joel article and some other I18n articles I always kept a close eye to my character encoding; And it actually works if you do it consistantly. If you work in a company where it is standard to use UTF-8 and everybody knows this / does this it will work.

Here some interesting articles (besides Joel's article) on the subject:

A quote from the first article; Tips for using Unicode:

  • Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow.
  • Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.
  • Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.
  • Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world.
  • If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.
  • If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.
  • Go off to Amazon or somewhere and buy the latest revision of the printed Unicode standard; it contains pretty well everything you need to know.
  • Spend some time poking around the Unicode web site and learning how the code charts work.
  • If you're going to have to do any serious work with Asian languages, go buy the O'Reilly book on the subject by Ken Lunde.
  • If you have a Macintosh, run out and grab Lord Pixel's Unicode Font Inspection tool. Totally cool.
  • If you're really going to have to get down and dirty with the data, go attend one of the twice-a-year Unicode conferences. All the experts go and if you don't know what you need to know, you'll be able to find someone there who knows.