Ms-access – How to convert MS Access database encoding to UTF-8

asp-classiccharacter encodingms-access

I am currently working on a legacy Classic ASP + MS-Access application. I recently converted all the .asp files to UTF-8 from ISO-8859 (Windows) encoding.

The problem I have now is that the text stored inside the database (French with accented characters) display improperly when rendered inside the web pages because the encodings are inconsistent. How do I convert my MS Access database encoding from ISO-8859 to UTF-8?

Best Answer

How do I convert my MS Access database encoding from ISO-8859 to UTF-8?

You don't. Access can handle Unicode text but it DOES NOT store it as UTF-8. There are ways to directly insert UTF-8 encoded text into Access Text fields but that leads to strange behaviour as illustrated in my other answer here.

For an ASP application, simply use .asp pages encoded as UTF-8, tell IIS to produce UTF-8 output (via the <%@ CODEPAGE = 65001 %> directive), and let IIS and the Access OLEDB driver handle the conversion between "Access Unicode" and UTF-8.

For a detailed example of Access, Classic ASP, and UTF-8 see my answer here:

Capture and insert Unicode text (Cyrillic) into MS access database

Important Note

Be aware that you should NOT be using an Access database as a back-end data store for a web application; Microsoft strongly recommends against doing so (ref: here).

Related Solutions

C# – How to get a consistent byte representation of strings in C# without manually specifying an encoding

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!

Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Just do this instead:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.

Additional benefit to this approach: It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

It will be encoded and decoded just the same, because you are just looking at the bytes.

If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

Java – How to convert between ISO-8859-1 and UTF-8 in Java

In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.

To transcode text:

byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.

Best Answer

Related Solutions

C# – How to get a consistent byte representation of strings in C# without manually specifying an encoding

Java – How to convert between ISO-8859-1 and UTF-8 in Java

Related Topic