Linux – Using split without breaking the encoding

linux

I need to split a file. I usually use split but this time I need to have the splitted file of the same encoding type of the original. I have the original:

eianni@ianni-desktop:~/Desktop$ file FCAna.txt 
FCAna.txt: ISO-8859 text, with CRLF line terminators

while new ones are:

eianni@ianni-desktop:~/Desktop$ file xaa
xaa: ISO-8859 text, with CRLF line terminators
eianni@ianni-desktop:~/Desktop$ file xab
xab: Non-ISO extended-ASCII text, with CRLF line terminators

the second one is not ok. How can I solve this?
The command executed is

split --lines=1588793 FCAna.txt

Best Answer

I think this could be down to the way file works. Reading from the manpage:

ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

My interpretation of this statement is that file's ability to determine the encoding used is based off whether there are certain characters available in the text file that make it obvious what the encoding is. So for UTF the size of the bytes or existence of a BOM could be used. Your original text file may have used characters that could only be encoded in an extended ASCII character set (a pound symbol (£) maybe?) and so file was determining it was an ISO 8859 file. But now that the file is split, that symbol only appears in the first file and not the second. You should be able to test this hypothesis by searching in the text for "extended" characters and splitting at different points.

As a test I did the following:

[root@blah ~]# echo "this is a test of text encoding" > test_encoding.txt
[root@blah ~]# file test_encoding.txt
test_encoding.txt: ASCII text
[root@blah ~]# echo "£" >> test_encoding.txt
[root@blah ~]# file test_encoding.txt
test_encoding.txt: ISO-8859 text
[root@blah ~]#

Is there a reason you needed the file encodings reported by file to match?

Related Solutions

Linux – filenames encoding problem when migrating a PHP Application from Windows Server 2003 to Linux

Windows usually uses unicode to encode non-ASCII characters, so if you're using a unicode-locale on your debian server you're set. It doesn't have to be french just because the characters you're trying to use are a french speciality (just tested this, I have my LANG set to en_US.UTF-8 and I can create a file with the name you mentioned ("accusé réception.pdf") and it shows up that way as well.

Chances are the accents are there, they just can't be displayed. To test this theory, replace that "ls" command with "LANG=en_US.UTF8 ls". If it shows up correctly it's just your terminal. Just set your LANG variable in your shell's startup file (eg. .bashrc) or system-wide in /etc/default/locale

Linux – How to read a single file in a maildir

Ok, answering my own question here, based on some googling and the helpful comments by mailq.

In short: I installed and used mutt. I had to fiddle a bit with my setup: Inside the directory my_dir where fakemail was creating the mail files, I created the dirs new, cur and tmp and pointed fakemail to my_dir/new. Then I started mutt with

mutt -f my_dir

Now I can review new mails, look at old mails, the umlauts are properly displayed - perfect!

Best Answer

Related Solutions

Linux – filenames encoding problem when migrating a PHP Application from Windows Server 2003 to Linux

Linux – How to read a single file in a maildir

Related Topic