Python Logging – Processing Postfix Logs with Python

loggingpython

I need to process all log messages from Postfix (/var/log/mail/mail.log), and print a summary/statistics (how many emails were sent/received and from/to which email addresses)

The situation is made more complicated by the fact, that Postfix has multi-line log entries (in contrast, Apache for example, has single line entries and the task would have been much easier).

A sample Postfix log might look something like this:

2013-12-03 14:40:45  postfix:  6F1AA10B: client=unknown[64.12.143.81]
2013-12-03 14:40:45  postfix:  6F1AA10B: message-id=<529DDF56.6050403@aol.com>
2013-12-03 14:40:45  postfix:  6F1AA10B: from=<martin.vegter@aol.com>, size=1571, nrcpt=1 (queue active)
2013-12-03 14:40:45  postfix:  6F1AA10B: to=<martin@example.com>, relay=local, delay=0.13, delays=0.13
2013-12-03 14:40:45  postfix:  6F1AA10B: removed

2013-12-03 14:52:07  postfix:  9DD9610B: client=unknown[209.85.219.65]
2013-12-03 14:52:07  postfix:  9DD9610B: message-id=<CANE3EAQUsGwj6ZBAU-awymzsG=76XZnHih@mail.gmail.com>
2013-12-03 14:52:07  postfix:  9DD9610B: from=<martin.vegter@gmail.com>, size=2388, nrcpt=1 (queue active)
2013-12-03 14:52:07  postfix:  9DD9610B: to=<martin@example.com>, orig_to=<martin@example.com>, relay=local
2013-12-03 14:52:07  postfix:  9DD9610B: removed

Every email message that was processed by Postfix has a unique message ID (in my example 6F1AA10B).

What would be the best approach to process the logs in Python? What data structure would you recommend to use for storing the entries?

Best Answer

How you store your items depends on how you are processing them further; in-memory aggregating is very different from storing individual items in rows in a SQL database, for example.

Parsing could be done by grouping records on a specific element in the line. Presumably an event for a given message ID can span multiple timestamps, but you can parse out lines into a dictionary, then use itertools.groupby() to group parsed lines.

I'll not go into the line parsing itself, but if we assume that a dictionary is produced to with a msgid key you can do:

from itertools import groupby
from operator import itemgetter

for msgid, messages in groupby(parsed_lines, key=itemgetter('msgid')):
    for message in messages:
        # Each `message` is a dictionary where the `msgid` is the same
Related Topic