Python Sorting – Should You Sort a List During or After Creation?

data structuresefficiencypythonsorting

I have to read through an extremely large amount of network data from various log files, and compile relevant information about that data to perform statistical analysis on it (the top communicators, top ip addresses that send on average the largest packets, average packet size, etc.) I decided to make a matrix where each index in the outer array corresponds to a sorted list of ips in incremental order.

To get an idea of my data set, I have a log file that is generated each day which has information regarding all the communication that occurred within a network that day. I have been generating log files since the middle of May and each log file has about 5 million lines each, so the more efficient this program is, the better.

My question is the following:

When I am compiling data from all the log files into this matrix, should I sort my outer array layer by IPs while I'm still putting together the data, or do I just append the new IPs to the end of the file as I find them and then sort the list afterwards? Is there a better option here that I haven't thought of? which way would be the most efficient? I am going to be using python 2.7, if that makes a difference. Also, due to restrictions on what I can do, I cannot install any new modules to the machine this code will run on, so unless I can create a database with python natively, that isn't an available option to me.

Best Answer

Insertion sort works best when you are inserting a new value into something that is already sorted. So what I would do is use Quicksort to sort the original dataset that you have, then when additional log entries come in, add them one by one into the already sorted set.

With Quicksort being O(n*logn) and Insertion Sort being O(n) when used with an already sorted set, the total time for everything will be O(a * log(a) + b), where a is the size of the original data set and b is the additional logs you place in afterwards.