I need to process all log messages from Postfix (/var/log/mail/mail.log
), and print a summary/statistics (how many emails were sent/received and from/to which email addresses)
The situation is made more complicated by the fact, that Postfix has multi-line log entries (in contrast, Apache
for example, has single line entries and the task would have been much easier).
A sample Postfix log might look something like this:
2013-12-03 14:40:45 postfix: 6F1AA10B: client=unknown[64.12.143.81]
2013-12-03 14:40:45 postfix: 6F1AA10B: message-id=<529DDF56.6050403@aol.com>
2013-12-03 14:40:45 postfix: 6F1AA10B: from=<martin.vegter@aol.com>, size=1571, nrcpt=1 (queue active)
2013-12-03 14:40:45 postfix: 6F1AA10B: to=<martin@example.com>, relay=local, delay=0.13, delays=0.13
2013-12-03 14:40:45 postfix: 6F1AA10B: removed
2013-12-03 14:52:07 postfix: 9DD9610B: client=unknown[209.85.219.65]
2013-12-03 14:52:07 postfix: 9DD9610B: message-id=<CANE3EAQUsGwj6ZBAU-awymzsG=76XZnHih@mail.gmail.com>
2013-12-03 14:52:07 postfix: 9DD9610B: from=<martin.vegter@gmail.com>, size=2388, nrcpt=1 (queue active)
2013-12-03 14:52:07 postfix: 9DD9610B: to=<martin@example.com>, orig_to=<martin@example.com>, relay=local
2013-12-03 14:52:07 postfix: 9DD9610B: removed
Every email message that was processed by Postfix has a unique message ID (in my example 6F1AA10B
).
What would be the best approach to process the logs in Python? What data structure would you recommend to use for storing the entries?
Best Answer
How you store your items depends on how you are processing them further; in-memory aggregating is very different from storing individual items in rows in a SQL database, for example.
Parsing could be done by grouping records on a specific element in the line. Presumably an event for a given message ID can span multiple timestamps, but you can parse out lines into a dictionary, then use
itertools.groupby()
to group parsed lines.I'll not go into the line parsing itself, but if we assume that a dictionary is produced to with a
msgid
key you can do: