Linux – How to Determine Filename Language Encoding

encodingext3linux

I have a directory with ~10,000 image files from an external source.

Many of the filenames contain spaces and punctuation marks that are not DB friendly or Web friendly. I also want to append a SKU number to the end of every filename (for accounting purposes). Many, if not most of the filenames also contain extended latin characters which I want to keep for SEO purposes (specifically so the filenames accurately represent the file contents in Google Images)

I have made a bash script which renames (copies) all the files to my desired result. The bash script is saved in UTF-8. After running it omits approx 500 of the files (unable to stat file…).

I have run convmv -f UTF-8 -t UTF-8 on the directory, and discovered these 500 filenames are not encoded in UTF-8 (convmv is able to detect and ignore filenames already in UTF-8)

Is there an easy way I can find out which language encoding they are currently using?

The only way I've been able to figure out myself is by setting my terminal encoding to UTF-8, then iterating through all the likely candidate encodings with convmv until it displays a converted name that 'looks right'. I have no way to be certain that these 500 files all use the same encoding, so I would need to repeat this process 500 times. I would like a more automated method than 'looks right' !!!

Best Answer

There's no 100% accurate way really, but there's a way to give a good guess.

There is a python library chardet which is available here: https://pypi.python.org/pypi/chardet

e.g.

See what the current LANG variable is set to:

$ echo $LANG
en_IE.UTF-8

Create a filename that'll need to be encoded with UTF-8

$ touch mÉ.txt

Change our encoding and see what happens when we try and list it

$ ls m*
mÉ.txt
$ export LANG=C
$ ls m*
m??.txt

OK, so now we have a filename encoded in UTF-8 and our current locale is C (standard Unix codepage).

So start up python, import chardet and get it to read the filename. I'm use some shell globbing (i.e. expansion through the * wildcard character) to get my file. Change "ls m*" to whatever will match one of your example files.

>>> import chardet
>>> import os
>>> chardet.detect(os.popen("ls m*").read())
{'confidence': 0.505, 'encoding': 'utf-8'}

As you can see, it's only a guess. How good a guess is shown by the "confidence" variable.

Related Solutions

Linux – Filename Length Limits Explained

See the Wikipedia page about file systems comparison, especially in column Maximum filename length.

Here are some filename length limits in popular file systems:

BTRFS   255 bytes
exFAT   255 UTF-16 characters
ext2    255 bytes
ext3    255 bytes
ext3cow 255 bytes
ext4    255 bytes
FAT32   8.3 (255 UCS-2 code units with VFAT LFNs)
NTFS    255 characters
XFS     255 bytes

Linux – How to run a server on port 80 as a normal user on Linux

Short answer: you can't. Ports below 1024 can be opened only by root. As per comment - well, you can, using CAP_NET_BIND_SERVICE, but that approach, applied to java bin will make any java program to be run with this setting, which is undesirable, if not a security risk.

The long answer: you can redirect connections on port 80 to some other port you can open as normal user.

Run as root:

# iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8080

As loopback devices (like localhost) do not use the prerouting rules, if you need to use localhost, etc., add this rule as well (thanks @Francesco):

# iptables -t nat -I OUTPUT -p tcp -d 127.0.0.1 --dport 80 -j REDIRECT --to-ports 8080

NOTE: The above solution is not well suited for multi-user systems, as any user can open port 8080 (or any other high port you decide to use), thus intercepting the traffic. (Credits to CesarB).

EDIT: as per comment question - to delete the above rule:

# iptables -t nat --line-numbers -n -L

This will output something like:

Chain PREROUTING (policy ACCEPT)
num  target     prot opt source               destination         
1    REDIRECT   tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:8080 redir ports 8088
2    REDIRECT   tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:80 redir ports 8080

The rule you are interested in is nr. 2, so to delete it:

# iptables -t nat -D PREROUTING 2

Best Answer

Related Solutions

Linux – Filename Length Limits Explained

Linux – How to run a server on port 80 as a normal user on Linux

Related Topic