Javascript – How to convert special UTF-8 chars to their iso-8859-1 equivalent using javascript

character encodingjavascriptjquery

I'm making a javascript app which retrieves .json files with jquery and injects data into the webpage it is embedded in.

The .json files are encoded with UTF-8 and contains accented chars like é, ö and å.

The problem is that I don't control the charset on the pages that are going to use the app.

Some will be using UTF-8, but others will be using the iso-8859-1 charset. This will of course garble the special chars from the .json files.

How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

Best Answer

Actually, everything is typically stored as Unicode of some kind internally, but lets not go into that. I'm assuming you're getting the iconic "Ã¥Ã¤Ã¶" type strings because you're using an ISO-8859 as your character encoding. There's a trick you can do to convert those characters. The escape and unescape functions used for encoding and decoding query strings are defined for ISO characters, whereas the newer encodeURIComponent and decodeURIComponent which do the same thing, are defined for UTF8 characters.

escape encodes extended ISO-8859-1 characters (UTF code points U+0080-U+00ff) as %xx (two-digit hex) whereas it encodes UTF codepoints U+0100 and above as %uxxxx (%u followed by four-digit hex.) For example, escape("å") == "%E5" and escape("あ") == "%u3042".

encodeURIComponent percent-encodes extended characters as a UTF8 byte sequence. For example, encodeURIComponent("å") == "%C3%A5" and encodeURIComponent("あ") == "%E3%81%82".

So you can do:

fixedstring = decodeURIComponent(escape(utfstring));

For example, an incorrectly encoded character "å" becomes "Ã¥". The command does escape("Ã¥") == "%C3%A5" which is the two incorrect ISO characters encoded as single bytes. Then decodeURIComponent("%C3%A5") == "å", where the two percent-encoded bytes are being interpreted as a UTF8 sequence.

If you'd need to do the reverse for some reason, that works too:

utfstring = unescape(encodeURIComponent(originalstring));

Is there a way to differentiate between bad UTF8 strings and ISO strings? Turns out there is. The decodeURIComponent function used above will throw an error if given a malformed encoded sequence. We can use this to detect with a great probability whether our string is UTF8 or ISO.

var fixedstring;

try{
    // If the string is UTF-8, this will work and not throw an error.
    fixedstring=decodeURIComponent(escape(badstring));
}catch(e){
    // If it isn't, an error will be thrown, and we can assume that we have an ISO string.
    fixedstring=badstring;
}

Do:

var isTrueSet = (myValue === 'true');

using the identity operator (===), which doesn't make any implicit type conversions when the compared variables have different types.

Don't:

You should probably be cautious about using these two methods for your specific needs:

var myBool = Boolean("false");  // == true

var myBool = !!"false";  // == true

Any string which isn't the empty string will evaluate to true by using them. Although they're the cleanest methods I can think of concerning to boolean conversion, I think they're not what you're looking for.

Best Answer

Related Solutions

Javascript – How to convert decimal to hexadecimal in JavaScript

Javascript – How to convert a string to boolean in JavaScript

Do:

Don't:

Related Topic