Bash – How to split a large text file into smaller files with an equal number of lines

bashfileunix

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

I could do this fairly easily in Python, but I'm wondering if there's any kind of ninja way to do this using Bash and Unix utilities (as opposed to manually looping and counting / partitioning lines).

Best Answer

Have a look at the split command:

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

You could do something like this:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.

Related Solutions

Python – How to copy a file in Python

shutil has many methods you can use. One of which is:

from shutil import copyfile
copyfile(src, dst)

# 2nd option
copy(src, dst)  # dst can be a folder; use copy2() to preserve timestamp

Copy the contents of the file named src to a file named dst. Both src and dst need to be the entire filename of the files, including path.
The destination location must be writable; otherwise, an IOError exception will be raised.
If dst already exists, it will be replaced.
Special files such as character or block devices and pipes cannot be copied with this function.
With copy, src and dst are path names given as strs.

Another shutil method to look at is shutil.copy2(). It's similar but preserves more metadata (e.g. time stamps).

If you use os.path operations, use copy rather than copyfile. copyfile will only accept strings.

Unix – text files end with a newline

Because that’s how the POSIX standard defines a line:

3.206 Line

A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

Therefore, lines not ending in a newline character aren't considered actual lines. That's why some programs have problems processing the last line of a file if it isn't newline terminated.

There's at least one hard advantage to this guideline when working on a terminal emulator: All Unix tools expect this convention and work with it. For instance, when concatenating files with cat, a file terminated by newline will have a different effect than one without:

$ more a.txt
foo
$ more b.txt
bar$ more c.txt
baz
$ cat {a,b,c}.txt
foo
barbaz

And, as the previous example also demonstrates, when displaying the file on the command line (e.g. via more), a newline-terminated file results in a correct display. An improperly terminated file might be garbled (second line).

For consistency, it’s very helpful to follow this rule – doing otherwise will incur extra work when dealing with the default Unix tools.

Think about it differently: If lines aren’t terminated by newline, making commands such as cat useful is much harder: how do you make a command to concatenate files such that

it puts each file’s start on a new line, which is what you want 95% of the time; but
it allows merging the last and first line of two files, as in the example above between b.txt and c.txt?

Of course this is solvable but you need to make the usage of cat more complex (by adding positional command line arguments, e.g. cat a.txt --no-newline b.txt c.txt), and now the command rather than each individual file controls how it is pasted together with other files. This is almost certainly not convenient.

… Or you need to introduce a special sentinel character to mark a line that is supposed to be continued rather than terminated. Well, now you’re stuck with the same situation as on POSIX, except inverted (line continuation rather than line termination character).

_{Now, on non POSIX compliant systems (nowadays that’s mostly Windows), the point is moot: files don’t generally end with a newline, and the (informal) definition of a line might for instance be “text that is separated by newlines” (note the emphasis). This is entirely valid. However, for structured data (e.g. programming code) it makes parsing minimally more complicated: it generally means that parsers have to be rewritten. If a parser was originally written with the POSIX definition in mind, then it might be easier to modify the token stream rather than the parser — in other words, add an “artificial newline” token to the end of the input.}

Best Answer

Related Solutions

Python – How to copy a file in Python

Unix – text files end with a newline

Related Topic