Logstash S3 input plugin re-scanning all bucket objects

amazon s3logstash

I am using the Logstash S3 Input plugin to process S3 access logs.

The access logs are all stored in a single bucket, and there are thousands of them. I have set up the plugin to only include S3 objects with a certain prefix (based on date eg 2016-06).

However, I can see that Logstash is re-polling every object in the Bucket, and not taking account of objects it has previously analysed.

{:timestamp=>"2016-06-21T08:50:51.311000+0000", :message=>"S3 input: Found key", :key=>"2016-06-01-15-21-10-178896183CF6CEBB", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"111", :method=>"list_new_files"}

Every minute (or whatever interval you have set) Logstash starts at the beginning of the bucket and makes an AWS API call for every object it finds. It seems to do this to find out what the last modified time of the object is, so that it can include relevant files for analysis. This obviously slows everything down, and doesn't give me real time analysis of the access logs.

Other than constantly updating the prefix to match only recent files, is there some way to make Logstash skip reading older S3 Objects?

There is a sincedb_path parameter for the plugin, but that only seems to relate to where the data about what file has last been analysed is written.

Best Answer

This seems to be the default behaviour for this plugin, so it has to be managed using the plugin features.

Basically, you have to set up the plugin to backup-then-delete the objects with a prefix to the same bucket. In that way, Logstash will skip object when it polls the bucket after the next interval.

Sample config:

s3 {
   bucket => "s3-access-logs-eu-west-1"
   type => "s3-access"
   prefix => "2016-"
   region => "eu-west-1"
   sincedb_path => "/tmp/last-s3-file-s3-access-logs-eu-west-1"
   backup_add_prefix => "logstash-"
   backup_to_bucket => "s3-access-logs-eu-west-1"
   interval => 120
   delete => true
 }

This config will scan the bucket ever 120 seconds for objects starting with

2016-

It will process those objects, then back them up to the same bucket with prefix

logstash-

then delete them.

This means they won't be found at the next polling interval.

2 important notes:

You can't use backup_add_prefix by itself (the docs suggest you can). You can only use this parameter in conjunction with backup_to_bucket
Make sure the IAM account/role you are using to interface with S3 has Write permissions for the buckets you are using (other Logstash can't delete/rename objects).

Related Solutions

Logstash tcp input not passed to elasticsearch

After a few more frustrating days I have concluded that the json/json_lines codec is broken - possibly only when used with tcp inputs.

However, I found a workaround, using a filter:

filter {
  if ("sensu" in [tags]) {
    json {
      "source" => "message"
    }
  }
}

This, and a few mutations produces the effect I was originally trying to achieve. For posterity, here's my working logstash.conf which combines logs and cpu/memory metrics data from sensu:

input {
  file {
    path => [
      "/var/log/messages"
      , "/var/log/secure"
    ]
    type => "syslog"
    start_position => "end"
  }

  file {
    path => "/var/log/iptables"
    type => "iptables"
    start_position => "end"
  }

  file {
    path => ["/var/log/httpd/access_log"
        ,"/var/log/httpd/ssl_access_log"
    ]
    type => "apache_access"
    start_position => "end"
  }

  file {
    path => [
      "/var/log/httpd/error_log"
      , "/var/log/httpd/ssl_error_log"
    ]
    type => "apache_error"
    start_position => "end"
  }

  lumberjack {
    port => 5043
    type => "logs"
    ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
    ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
  }

  tcp {
    host => "localhost"
    port => 9250
    mode => "server"
    tags => ["sensu"]
  }

}

filter {
  if ("sensu" in [tags]) {
    json {
      "source" => "message"
    }
    mutate {
      rename => { "[check][name]" => "type" }
      replace => { "host" => "%{[client][address]}" }
      split => { "[check][output]" => " " }
      add_field => { "output" => "%{[check][output][1]}" }
      remove_field => [ "[client]", "[check]", "occurrences" ]
    }
  } else if([type] == "apache_access") {
    grok {
      match => { "message" => "%{IP:client}" }
    }
  }
}

filter {
  mutate {
    convert => { "output" => "float" }
  }
}

output {
  elasticsearch {
    host => "localhost"
    cluster => "webCluser"
  }
}

Unrelated to issue: The "output" is received as multiple values separated by spaces, hence the "split" operation. The second element is used and then converted to float, so that Kibana graphs it nicely (something I learned the hard way).

Logstash failing to parse syslog input

The syslog input use grok internally, your message is probably not following the syslog standard 100%.

The solution in this link worked for me: http://kartar.net/2014/09/when-logstash-and-syslog-go-wrong/

The key info from the link is:

Replace the existing syslog block in the Logstash configuration with:

input {
  tcp {
    port => 514
    type => syslog
  }
  udp {
    port => 514
    type => syslog
  }
}

Next, replace the parsing element of our syslog input plugin using a grok filter plugin.

filter {
  if [type] == "syslog" {
    grok {
      match => { "message" => "<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
    }
  }
}

You can edit the filter matching ("grok") syntax now, to match your desired format. It's also possible to support multiple different syntaxes with creative use of if, else if, and else.

Best Answer

Related Solutions

Logstash tcp input not passed to elasticsearch

Logstash failing to parse syslog input

Related Topic