Ways to parse NCSA combined based log files

apache-2.2loggingncsaparsingsquid

I've done a bit of site: searching with Google on Server Fault, Super User and Stack Overflow. I also checked non site specific results and and didn't really see a question like this, so here goes…

I did spot this question, related to grep and awk which has some great knowledge but I don't feel the text qualification challenge was addressed. This question also broadens the scope to any platform and any program.

I've got squid or apache logs based on the NCSA combined format. When I say based, meaning the first n col's in the file are per NCSA combined standards, there might be more col's with custom stuff.

Here is an example line from a squid combined log:

1.1.1.1 - - [11/Dec/2010:03:41:46 -0500] "GET http://yourdomain.com:8080/en/some-page.html HTTP/1.1" 200 2142 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; C) AppleWebKit/532.4 (KHTML, like Gecko)" TCP_MEM_HIT:NONE

I'd like to be able to parse n logs and output specific columns, for sorting, counting, finding unique values etc

The main challenge and what makes it a little tricky and also why I feel this question hasn't yet been asked or answered, is the text qualification conundrum.

When I spotted asql from the grep/awk question, I was very excited but then realised that it didn't support combined out of the box, something I'll look at extending I guess.

Looking forward to answers, and learning new stuff!
Answers doesn't have to be limited to platform or program/language. For the context of this question, the platforms I use the most are Linux or OSX.

Cheers

Best Answer

Using Perl, tested on v5.10.0 built for darwin-thread-multi-2level (OSX)

To print the UserAgent column:

perl -n -e '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+) ([0-9\-]+) "(.*)" "(.*)"/; print "$11\n"' -- test.log

option -n while each line in test.log
option -e one line program

I stole and tweaked the perlre which I Googled from the PHP cookbook. I removed the $ from the end of the re to support custom formats based on NCSA combined. The pattern can be easily extended to provide more groups.

The regular expression groups () end up as local variables $1 to $n

Quick and dirty and very easy to extend and script.

Some examples of piping the output:

| sort | uniq unique column values
| sort | uniq | wc -l unique column count

Critique and improvements welcome

Related Solutions

Parse/Edit Apache conf files with Ruby

I ended up just writing my own ruby script... Not very well done, but in case anyone needs it, here's the guts of it. It is looking for the contents of the <VirtualHost></VirtualHost> tag so that it can create a second <VirtualHost> with a ServerName which is a subdomain of our wildcard SSL cert...

begin
  logMsg "Updating apache config file for user #{user} (#{domain_httpd_conf})"

  domain_httpd_conf_io = File.open(domain_httpd_conf,File::RDONLY)

  ip_addr = ''
  main_vhost_config = []
  ssl_vhost_config = ["  ServerName #{auto_ssl_domain}",'  Include "conf/wildcard-ssl.conf"']

  indent = 1

  while line = domain_httpd_conf_io.gets

    line_indented = '  '*indent+line.strip

    if line =~ /^[[:space:]]*<VirtualHost ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)(:[0-9]+)[^>]*>/
      ip = $1
    elsif line =~ /^[[:space:]]*<\/VirtualHost>/
      break 2
    elsif line =~ /^[[:space:]]*(ServerAlias|ServerName).*/
      main_vhost_config.push line_indented
    else

      if line =~ /^[[:space:]]*<[^\/]/
        indent += 1
      elsif line =~ /^[[:space:]]*<[\/]/
        indent = [1, indent-1].max
        line_indented = '  '*indent + line.strip
      end

      main_vhost_config.push line_indented
      ssl_vhost_config.push line_indented
    end
  end

  main_vhost_config.push "  Include #{extraconf_dir}/*.conf"

  domain_httpd_conf_io.close
  domain_httpd_conf_io = File.open(domain_httpd_conf,File::WRONLY||File::TRUNC)

  domain_httpd_conf_io.puts "<VirtualHost #{ip}:80 #{ip}:8080>"
  domain_httpd_conf_io.puts main_vhost_config
  domain_httpd_conf_io.puts "</VirtualHost>"

  domain_httpd_conf_io.puts

  domain_httpd_conf_io.puts "<VirtualHost #{ip}:443 #{ip}:8888>"
  domain_httpd_conf_io.puts ssl_vhost_config
  domain_httpd_conf_io.puts "</VirtualHost>"

rescue SystemCallError => err
  logErr "ERROR: Unexpected error: "+err

  domain_httpd_conf_io.close
end

Still has some bugs to work out but it mostly does what I want.

Problem with squid log files

Number 1 i was suing an old version of SARG, the latest version is 2.2.7.1 this has no problems apart from one file that needs to be changed, ip2name.c

void ip2name(char *ip,int ip_len)
{
u_long addr;
struct hostent *hp;
char **p;

by

void ip2name(char *ip,int ip_len)
{
unsigned long addr;
struct hostent *hp;
char **p;

After doing this, the reports will run very well on a MAC

Best Answer

Related Solutions

Parse/Edit Apache conf files with Ruby

Problem with squid log files

Related Topic