Php – difference between en_US.utf8 and en_US.UTF-8

charsetencodinglocalizationPHPutf-8

Server info (DNS and IPs removed):

cat /proc/version && uname -a && java -version

Linux version 2.6.16.33-xenU (*************) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-52)) #2 SMP Wed Aug 15 17:27:36 SAST 2007
Linux ************* *************-xenU #2 SMP Wed Aug 15 17:27:36 SAST 2007 x86_64 x86_64 x86_64 GNU/Linux
java version "1.6.0_14"
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)

I have some PHP code that is reading from an Excel file and doing string comparisons. It is failing on the server due to what seems to be a locale issue. On my local machine (OSX 10.8.5 Mountain Lion) however, it works!

On my local machine the locale is en_US.UTF-8. On the server the locale was POSIX but I changed it to en_US.utf8 since there was no en_US.UTF-8 when I looked at locale -a (interestingly, the list of locales on the server are all lower case but on my Mac they are all upper case, which is where this questions stems from).

Is there a difference between the two that could affect string comparisons?

Also, as per this SF post I ran locale -v -a. On the server, en-US.utf8 uses the UTF-8 codeset (I'm assuming this is the same as what I normally call charset?). However, on my local machine I seem unable to run the locale -v -a command, though locale and locale -a work fine.

Best Answer

TL;DR:

The codepage / character set .utf8 in en_US.utf8 is not officially recognised as far as I can tell. There is no IANA utf8 character set name. utf8 is likely generated by glibc - see final heading.

The IANA character set name is UTF-8.

The hyphen is important
Case is insensitive

Therefore, these are all valid:

en_US.utf-8
en_US.UTF-8
en_US.uTf-8

There is also a !case-sensitive! alias for the name UTF-8, namely: csUTF8.

Therefore, this would also be valid:

en_US.csUTF8

But I have never seen this in the wild.

The details, with chapter and verse

UTF-8 is a valid IANA character set name, whereas utf8 is not. It's not even a valid alias.

POSIX.1-2017, section 8.2 Internationalization Variables says:

If the locale value has the form:
language[_territory][.codeset]
it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

Here the part in question is the [.codeset] part, which POSIX doesn't define, but IANA does.

For the character set defined by RFC2978: UTF-8, a transformation format of ISO 10646, the IANA Character Sets lists the name as:

UTF-8

and the note at the top says:

These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation.

An alias csUTF8 is provided, about which RFC2978 IANA Charset Registration Procedures, section 2.3 says:

All other names are considered to be aliases for the primary name and use of the primary name is preferred over use of any of the aliases.

IANA Character Sets also says:

The "cs" stands for character set and is provided for applications that need a lower case first letter but want to use mixed case thereafter that cannot contain any special characters, such as underbar ("_") and dash ("-").

In the cs alias, the case is significant (while the name is defined as case insensitive, above).

Given the alias csUTF8, en_US.csUTF8 would also be valid, but I have never seen this format in the wild.

While case matters in aliases, regarding names, IANA Character Sets says:

The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters.

So while en_US.utf-8 is valid (a lowercase version of the listed UTF-8), en_US.utf8 doesn't refer to a IANA character set as it drops the -.

If it's not IANA, where does `utf8` likely come from?

glibc's _nl_normalize_codeset() does the following:

Only passes characters or a digits (goodbye hyphen)

Converts characters to lowercase

for (cnt = 0; cnt < name_len; ++cnt)
  if (__isalpha_l ((unsigned char) codeset[cnt], locale))
    *wp++ = __tolower_l ((unsigned char) codeset[cnt], locale);
  else if (__isdigit_l ((unsigned char) codeset[cnt], locale))
    *wp++ = codeset[cnt];

Related Solutions

Are there any disadvantages of using UTF8 in an oracle database

You should have two choices to make :

Choose your database character set (used by VARCHAR2, CHAR, CLOB datatypes).
Choose your national character set (used by NVARCHAR2, NCHAR, NCLOB datatypes).

As seen here :

Oracle recommends using Unicode for all new system deployments.

National character sets can only be Unicode : UTF-8 or UTF-16. So choosing the same character set for both would be redundant...

My advice (you say your application is in English only) :

Ask for your database character set to be UTF-8.
Ask for your national character set to be UTF-16.

And here is my general advice for your schema definition. Table by table, column by column (I take the VARCHAR2/NVARCHAR2 sample here) :

if your column could contain any character in the world (as in user input), make it NVARCHAR2.
if you have control about what is going to be stored (English then), make it VARCHAR2.

Ubuntu – Server locale C vs en_US.UTF-8

You might want to edit /etc/default/locale to set the locale as your export command will only affect the current environment. It will not affect already running programs.

The issue you had mentioned regarding grep was fixed a few years ago already:

fixed in grep 2.7, released Sep 20, 2010

In multibyte locales, regular expressions including backreferences
no longer exhibit quadratic complexity (i.e., they are orders
of magnitude faster). [bug present since multi-byte character set
support was introduced in 2.5.2]

In UTF-8 locales, regular expressions including "." can be orders
of magnitude faster.  For example, "grep ." is now twice as fast
as "grep -v ^$", instead of being immensely slower.  It remains
slow in other multibyte locales. [bug present since multi-byte
character set support was introduced in 2.5.2]

http://savannah.gnu.org/forum/forum.php?forum_id=6521