Logstash S3 input plugin re-scanning all bucket objects

amazon s3logstash

I am using the Logstash S3 Input plugin to process S3 access logs.

The access logs are all stored in a single bucket, and there are thousands of them. I have set up the plugin to only include S3 objects with a certain prefix (based on date eg 2016-06).

However, I can see that Logstash is re-polling every object in the Bucket, and not taking account of objects it has previously analysed.

{:timestamp=>"2016-06-21T08:50:51.311000+0000", :message=>"S3 input: Found key", :key=>"2016-06-01-15-21-10-178896183CF6CEBB", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"111", :method=>"list_new_files"}

ie

Every minute (or whatever interval you have set) Logstash starts at the beginning of the bucket and makes an AWS API call for every object it finds. It seems to do this to find out what the last modified time of the object is, so that it can include relevant files for analysis. This obviously slows everything down, and doesn't give me real time analysis of the access logs.

Other than constantly updating the prefix to match only recent files, is there some way to make Logstash skip reading older S3 Objects?

There is a sincedb_path parameter for the plugin, but that only seems to relate to where the data about what file has last been analysed is written.

Best Answer

This seems to be the default behaviour for this plugin, so it has to be managed using the plugin features.

Basically, you have to set up the plugin to backup-then-delete the objects with a prefix to the same bucket. In that way, Logstash will skip object when it polls the bucket after the next interval.

Sample config:

s3 {
   bucket => "s3-access-logs-eu-west-1"
   type => "s3-access"
   prefix => "2016-"
   region => "eu-west-1"
   sincedb_path => "/tmp/last-s3-file-s3-access-logs-eu-west-1"
   backup_add_prefix => "logstash-"
   backup_to_bucket => "s3-access-logs-eu-west-1"
   interval => 120
   delete => true
 } 

This config will scan the bucket ever 120 seconds for objects starting with

2016-

It will process those objects, then back them up to the same bucket with prefix

logstash-

then delete them.

This means they won't be found at the next polling interval.

2 important notes:

  1. You can't use backup_add_prefix by itself (the docs suggest you can). You can only use this parameter in conjunction with backup_to_bucket

  2. Make sure the IAM account/role you are using to interface with S3 has Write permissions for the buckets you are using (other Logstash can't delete/rename objects).