HTTP dissector that reads from pcap

httppcap

I have some pcap data from a local interface which I'd like to analyze. Specifically, I'd like the content of HTTP sessions. I'm aware of many HTTP header statistics tools, but I would specifically like to reassemble the content of each complete HTTP connection.

Is there any suitable layer-4 packet dumping tool (for Linux) in the same way that tcpdump et al work for layer 3, something that can understand and manipulate HTTP?

Feel free to redirect me if this has been asked before, though I haven't been able to find any answer to this yet on SF. Thanks!

Best Answer

I would suspect that tcpflow would do your job well enough, which can take a pcap file and divvy it up into it's component parts. For instance, I just did the following as a test:

sudo tcpdump -i eth0 -n -s 0 -w /tmp/capt -v port 80

Then reloaded your question, stopped tcpdump, and then ran:

tcpflow -r /tmp/capt

And got about 20 files, each containing a single HTTP request or response (as appropriate).

On the other hand, I usually just go the soft option and open up my capture files in wireshark, whose "Analyze -> Follow TCP Stream" mode is freaking awesome (colour coded and everything).

Both of these tools, by the way, can do the packet capture themselves, too -- you don't have to feed them an existing packet capture via tcpdump.

If you have a specific need to parse the HTTP traffic after you've split it up, it's quite trivial: the HTTP protocol is very simple. In the trivial (non-keepalive/pipelined) case, you can use the following to get the request or response header:

sed '/^\r$/q' <connectionfile>

And this to get the body of the request/response:

sed -n '/^\r$/,$p' <connectionfile>

(You can also pipe things through those sed commands if you like).

On keepalive connections, you then need to start getting a little scripty, but even then it's about 20 lines of script to process the two files (A to B, B to A), extract the headers, read the Content-Length, then read the body -- and if you're doing any sort of automated processing, you'll be writing code to do that stuff anyway, so a bit of HTTP dissection doesn't add considerably to the workload.