R – Are you fluent in Unicode yet

asciiinternationalizationlanguage-agnosticunicode

Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

Like many, I read it carefully, realizing it was high-time I got to grips with this "replacement for ASCII". Unfortunately, 5 years later I feel I have slipped back into a few bad habits in this area. Have you?

I don't write many specifically international applications, however I have helped build many ASP.NET internet facing websites, so I guess that's not an excuse.

So for my benefit (and I believe many others) can I get some input from people on the following:

How to "get over" ASCII once and for all
Fundamental guidance when working with Unicode.
Recommended (recent) books and websites on Unicode (for developers).
Current state of Unicode (5 years after Joels' article)
Future directions.

I must admit I have a .NET background and so would also be happy for information on Unicode in the .NET framework. Of course this shouldn't stop anyone with a differing background from commenting though.

Update: See this related question also asked on StackOverflow previously.

Best Answer

Since I read the Joel article and some other I18n articles I always kept a close eye to my character encoding; And it actually works if you do it consistantly. If you work in a company where it is standard to use UTF-8 and everybody knows this / does this it will work.

Here some interesting articles (besides Joel's article) on the subject:

A quote from the first article; Tips for using Unicode:

Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow.
Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.
Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.
Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world.
If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.
If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.
Go off to Amazon or somewhere and buy the latest revision of the printed Unicode standard; it contains pretty well everything you need to know.
Spend some time poking around the Unicode web site and learning how the code charts work.
If you're going to have to do any serious work with Asian languages, go buy the O'Reilly book on the subject by Ken Lunde.
If you have a Macintosh, run out and grab Lord Pixel's Unicode Font Inspection tool. Totally cool.
If you're really going to have to get down and dirty with the data, go attend one of the twice-a-year Unicode conferences. All the experts go and if you don't know what you need to know, you'll be able to find someone there who knows.

Related Solutions

What and where are the stack and heap

The stack is the memory set aside as scratch space for a thread of execution. When a function is called, a block is reserved on the top of the stack for local variables and some bookkeeping data. When that function returns, the block becomes unused and can be used the next time a function is called. The stack is always reserved in a LIFO (last in first out) order; the most recently reserved block is always the next block to be freed. This makes it really simple to keep track of the stack; freeing a block from the stack is nothing more than adjusting one pointer.

The heap is memory set aside for dynamic allocation. Unlike the stack, there's no enforced pattern to the allocation and deallocation of blocks from the heap; you can allocate a block at any time and free it at any time. This makes it much more complex to keep track of which parts of the heap are allocated or freed at any given time; there are many custom heap allocators available to tune heap performance for different usage patterns.

Each thread gets a stack, while there's typically only one heap for the application (although it isn't uncommon to have multiple heaps for different types of allocation).

To answer your questions directly:

To what extent are they controlled by the OS or language runtime?

The OS allocates the stack for each system-level thread when the thread is created. Typically the OS is called by the language runtime to allocate the heap for the application.

What is their scope?

The stack is attached to a thread, so when the thread exits the stack is reclaimed. The heap is typically allocated at application startup by the runtime, and is reclaimed when the application (technically process) exits.

What determines the size of each of them?

The size of the stack is set when a thread is created. The size of the heap is set on application startup, but can grow as space is needed (the allocator requests more memory from the operating system).

What makes one faster?

The stack is faster because the access pattern makes it trivial to allocate and deallocate memory from it (a pointer/integer is simply incremented or decremented), while the heap has much more complex bookkeeping involved in an allocation or deallocation. Also, each byte in the stack tends to be reused very frequently which means it tends to be mapped to the processor's cache, making it very fast. Another performance hit for the heap is that the heap, being mostly a global resource, typically has to be multi-threading safe, i.e. each allocation and deallocation needs to be - typically - synchronized with "all" other heap accesses in the program.

A clear demonstration:
_{Image source: vikashazrati.wordpress.com}

Php – Best practices in PHP and MySQL with international strings

On the first look at http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet I think that one important thing is missing (perhaps I overlooked this one). Depending on your MySQL installation and/or configuration you have to set the connection encoding so that MySQL knows what encoding you're expecting on the client side (meaning the client side of the MySQL connection, which should be you PHP script). You can do this by manually issuing a

SET NAMES utf8

query prior to any other query you send to the MySQL server.

If your're using PDO on the PHP side you can set-up the connection to automatically issue this query on every (re)connect by using

$db=new PDO($dsn, $user, $pass);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES utf8");

when initializing your db connection.

Best Answer

Related Solutions

What and where are the stack and heap

Php – Best practices in PHP and MySQL with international strings

Related Topic