Server info (DNS and IPs removed):
cat /proc/version && uname -a && java -version
Linux version 2.6.16.33-xenU (*************) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-52)) #2 SMP Wed Aug 15 17:27:36 SAST 2007
Linux ************* *************-xenU #2 SMP Wed Aug 15 17:27:36 SAST 2007 x86_64 x86_64 x86_64 GNU/Linux
java version "1.6.0_14"
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)
I have some PHP code that is reading from an Excel file and doing string comparisons. It is failing on the server due to what seems to be a locale issue. On my local machine (OSX 10.8.5 Mountain Lion) however, it works!
On my local machine the locale is en_US.UTF-8. On the server the locale was POSIX but I changed it to en_US.utf8 since there was no en_US.UTF-8 when I looked at locale -a (interestingly, the list of locales on the server are all lower case but on my Mac they are all upper case, which is where this questions stems from).
Is there a difference between the two that could affect string comparisons?
Also, as per this SF post I ran locale -v -a. On the server, en-US.utf8 uses the UTF-8 codeset (I'm assuming this is the same as what I normally call charset?). However, on my local machine I seem unable to run the locale -v -a command, though locale and locale -a work fine.
Best Answer
TL;DR:
The codepage / character set
.utf8
inen_US.utf8
is not officially recognised as far as I can tell. There is no IANAutf8
character set name.utf8
is likely generated byglibc
- see final heading.The IANA character set name is
UTF-8
.Therefore, these are all valid:
en_US.utf-8
en_US.UTF-8
en_US.uTf-8
There is also a !case-sensitive! alias for the name
UTF-8
, namely:csUTF8
.Therefore, this would also be valid:
But I have never seen this in the wild.
The details, with chapter and verse
UTF-8
is a valid IANA character set name, whereasutf8
is not. It's not even a valid alias.POSIX.1-2017, section 8.2 Internationalization Variables says:
Here the part in question is the
[.codeset]
part, which POSIX doesn't define, but IANA does.For the character set defined by RFC2978:
UTF-8, a transformation format of ISO 10646
, the IANA Character Sets lists the name as:UTF-8
and the note at the top says:
An alias
csUTF8
is provided, about which RFC2978 IANA Charset Registration Procedures, section 2.3 says:IANA Character Sets also says:
In the
cs
alias, the case is significant (while the name is defined as case insensitive, above).Given the alias
csUTF8
,en_US.csUTF8
would also be valid, but I have never seen this format in the wild.While case matters in aliases, regarding names, IANA Character Sets says:
So while
en_US.utf-8
is valid (a lowercase version of the listedUTF-8
),en_US.utf8
doesn't refer to a IANA character set as it drops the-
.If it's not IANA, where does
utf8
likely come from?glibc's
_nl_normalize_codeset()
does the following:Only passes characters or a digits (goodbye hyphen)
Converts characters to lowercase