I need to split a file. I usually use split
but this time I need to have the splitted file of the same encoding type of the original. I have the original:
eianni@ianni-desktop:~/Desktop$ file FCAna.txt
FCAna.txt: ISO-8859 text, with CRLF line terminators
while new ones are:
eianni@ianni-desktop:~/Desktop$ file xaa
xaa: ISO-8859 text, with CRLF line terminators
eianni@ianni-desktop:~/Desktop$ file xab
xab: Non-ISO extended-ASCII text, with CRLF line terminators
the second one is not ok. How can I solve this?
The command executed is
split --lines=1588793 FCAna.txt
Best Answer
I think this could be down to the way
file
works. Reading from the manpage:My interpretation of this statement is that
file
's ability to determine the encoding used is based off whether there are certain characters available in the text file that make it obvious what the encoding is. So for UTF the size of the bytes or existence of a BOM could be used. Your original text file may have used characters that could only be encoded in an extended ASCII character set (a pound symbol (£) maybe?) and sofile
was determining it was an ISO 8859 file. But now that the file is split, that symbol only appears in the first file and not the second. You should be able to test this hypothesis by searching in the text for "extended" characters and splitting at different points.As a test I did the following:
Is there a reason you needed the file encodings reported by
file
to match?