Linux – Using split without breaking the encoding

linux

I need to split a file. I usually use split but this time I need to have the splitted file of the same encoding type of the original. I have the original:

eianni@ianni-desktop:~/Desktop$ file FCAna.txt 
FCAna.txt: ISO-8859 text, with CRLF line terminators

while new ones are:

eianni@ianni-desktop:~/Desktop$ file xaa
xaa: ISO-8859 text, with CRLF line terminators
eianni@ianni-desktop:~/Desktop$ file xab
xab: Non-ISO extended-ASCII text, with CRLF line terminators

the second one is not ok. How can I solve this?
The command executed is

split --lines=1588793 FCAna.txt

Best Answer

I think this could be down to the way file works. Reading from the manpage:

ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

My interpretation of this statement is that file's ability to determine the encoding used is based off whether there are certain characters available in the text file that make it obvious what the encoding is. So for UTF the size of the bytes or existence of a BOM could be used. Your original text file may have used characters that could only be encoded in an extended ASCII character set (a pound symbol (£) maybe?) and so file was determining it was an ISO 8859 file. But now that the file is split, that symbol only appears in the first file and not the second. You should be able to test this hypothesis by searching in the text for "extended" characters and splitting at different points.

As a test I did the following:

[root@blah ~]# echo "this is a test of text encoding" > test_encoding.txt
[root@blah ~]# file test_encoding.txt
test_encoding.txt: ASCII text
[root@blah ~]# echo "£" >> test_encoding.txt
[root@blah ~]# file test_encoding.txt
test_encoding.txt: ISO-8859 text
[root@blah ~]#

Is there a reason you needed the file encodings reported by file to match?