How to Detect File Encoding – File Systems, Character Encoding, UTF-8

character encodingfile-systemsnotepadutf-8

On my filesystem (Windows 7) I have some text files (These are SQL script files, if that matters).

When opened with Notepad++, in the "Encoding" menu some of them are reported to have an encoding of "UCS-2 Little Endian" and some of "UTF-8 without BOM".

What is the difference here? They all seem to be perfectly valid scripts. How could I tell what encodings the file have without Notepad++?

Best Answer

Files generally indicate their encoding with a file header. There are many examples here. However, even reading the header you can never be sure what encoding a file is really using.

For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file. However, it might be an ISO-8859-1 file which happens to start with the characters . Or it might be a different file type entirely.

Notepad++ does its best to guess what encoding a file is using, and most of the time it gets it right. Sometimes it does get it wrong though - that's why that 'Encoding' menu is there, so you can override its best guess.

For the two encodings you mention:

  • The "UCS-2 Little Endian" files are UTF-16 files (based on what I understand from the info here) so probably start with 0xFF,0xFE as the first 2 bytes. From what I can tell, Notepad++ describes them as "UCS-2" since it doesn't support certain facets of UTF-16.
  • The "UTF-8 without BOM" files don't have any header bytes. That's what the "without BOM" bit means.