How to parse s3 logs

amazon s3log-fileslogparserregex

I've been trying to parse AWS S3 logs following the documentation, but I've been running into some problems. Specifically, I keep running into new, rare log lines that break my regex. Every time this happens, I modify my regex to account for these lines, but I'd really like to get this right once and for all.

The challenge is essentially that the user-agent field is allowed to contain arbitrary characters, including quotes, apparently (they aren't even escaped!). I recently ran into the following record, for example:

8b24a6e5b101a6376ebfd307854b8379da11acdc3efc2a2cbbf305b4a2af8de7 -----redacted-bucket-name----- [13/Nov/2015:18:43:39 +0000] - 6AA66FF0D0ACB8BF REST.GET.OBJECT a.gif "GET /-----redacted-path----- HTTP/1.1" 200 - 26 26 18 17 "" "Mozilla/5.0 (Linux; Android 5.0.1; Alba 7" Tablet Build/LRX22C) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Safari/537.36" -

My parser successfully parsed more than 4.3 million log lines before it choked on this gem.

In the past, the parser has also choked on this record, before changing the <referrer> part of the regex to be non-greedy:

8b24a6e5b101a6376ebfd307854b8379da11acdc3efc2a2cbbf305b4a2af8de7 -----redacted-bucket-name----- [26/Sep/2015:12:59:27 +0000] - C5EBC1D929EBBD53 REST.GET.OBJECT a.gif "GET /-----redacted-path----- HTTP/1.1" 200 - 26 26 46 46 "",$B$1,"&TR=",$B$2,"&cid=0" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko" -

My parser regex (Python):

log_regex = re.compile(
    r'(?P<owner>\S+) '
    r'(?P<bucket>\S+) '
    r'\[(?P<timestamp>[^\]]+)\] '
    r'(?P<ip_address>\S+) '
    r'(?P<requester>\S+) '
    r'(?P<request_id>\S+) '
    r'(?P<operation>\S+) '
    r'(?P<key>\S+) '
    r'("(?P<request_uri>[^"]+)"|-) '
    r'(?P<status>\S+) '
    r'(?P<error_code>\S+) '
    r'(?P<response_bytes>\S+) '
    r'(?P<object_size>\S+) '
    r'(?P<request_ms>\S+) '
    r'(?P<processing_ms>\S+) '
    r'("(?P<referrer>.+?)" |- )'
    r'("(?P<user_agent>[^"]+)"|-) '

Am I taking the wrong approach? Are there ways to improve my regex? This has been really frustrating.

Best Answer

For now I've settled on this.

log_regex = re.compile(
    r'(?P<owner>\S+) '
    r'(?P<bucket>\S+) '
    r'\[(?P<timestamp>[^\]]+)\] '
    r'(?P<ip_address>\S+) '
    r'(?P<requester>\S+) '
    r'(?P<request_id>\S+) '
    r'(?P<operation>\S+) '
    r'(?P<key>\S+) '
    r'("(?P<request_uri>[^"]+)"|-) '
    r'(?P<status>\S+) '
    r'(?P<error_code>\S+) '
    r'(?P<response_bytes>\S+) '
    r'(?P<object_size>\S+) '
    r'(?P<request_ms>\S+) '
    r'(?P<processing_ms>\S+) '
    r'("(?P<referrer>.+?)"|-) '
    r'("(?P<user_agent>.+?)"|-) '
Related Topic