I have a directory with ~10,000 image files from an external source.
Many of the filenames contain spaces and punctuation marks that are not DB friendly or Web friendly. I also want to append a SKU number to the end of every filename (for accounting purposes). Many, if not most of the filenames also contain extended latin characters which I want to keep for SEO purposes (specifically so the filenames accurately represent the file contents in Google Images)
I have made a bash script which renames (copies) all the files to my desired result. The bash script is saved in UTF-8. After running it omits approx 500 of the files (unable to stat file…).
I have run convmv -f UTF-8 -t UTF-8 on the directory, and discovered these 500 filenames are not encoded in UTF-8 (convmv is able to detect and ignore filenames already in UTF-8)
Is there an easy way I can find out which language encoding they are currently using?
The only way I've been able to figure out myself is by setting my terminal encoding to UTF-8, then iterating through all the likely candidate encodings with convmv until it displays a converted name that 'looks right'. I have no way to be certain that these 500 files all use the same encoding, so I would need to repeat this process 500 times. I would like a more automated method than 'looks right' !!!
Best Answer
There's no 100% accurate way really, but there's a way to give a good guess.
There is a python library chardet which is available here: https://pypi.python.org/pypi/chardet
e.g.
See what the current LANG variable is set to:
Create a filename that'll need to be encoded with UTF-8
Change our encoding and see what happens when we try and list it
OK, so now we have a filename encoded in UTF-8 and our current locale is C (standard Unix codepage).
So start up python, import chardet and get it to read the filename. I'm use some shell globbing (i.e. expansion through the * wildcard character) to get my file. Change "ls m*" to whatever will match one of your example files.
As you can see, it's only a guess. How good a guess is shown by the "confidence" variable.