ElasticSearch Delayed Indexing

elasticsearchlogstash

I currently have the following setup:

syslog-ng servers –> Logstash –> ElasticSearch

The syslog-ng servers are load balanced and write to a SAN location where Logstash just tails the files and sends them to ES. I'm currently receiving around 1,300 events/sec to the syslog cluster for the networking logs. The issue I'm running into is a gradual delay in when the logs actually become searchable in ES. When I started the cluster (4 nodes), it was dead on. Then a few minutes behind and now after 4 days it's ~35 min behind. I can confirm the logs are writing to real time on the syslog-ng servers and I can also confirm that my 4 other indexes that are using the same concept but a different Logstash instance are staying up-to-date. However, they are significantly lower (~500 events/second).

It appears the Logstash instance that is reading the flat file is not able to keep up. I've already separated these files out once and spawned 2 Logstash instances for it to help, but I'm still falling behind.

Any help would be greatly appreciated.

Typical input are ASA logs, mainly denies and VPN connections

Jan  7 00:00:00 firewall1.domain.com Jan 06 2016 23:00:00 firewall1 : %ASA-1-106023: Deny udp src outside:192.168.1.1/22245 dst DMZ_1:10.5.1.1/33434 by access-group "acl_out" [0x0, 0x0]
Jan  7 00:00:00 firewall2.domain.com %ASA-1-106023: Deny udp src console_1:10.1.1.2/28134 dst CUSTOMER_094:2.2.2.2/514 by access-group "acl_2569" [0x0, 0x0]

Here is my Logstash config.

input {

file {
    type => "network-syslog"
    exclude => ["*.gz"]
    start_position => "end"
    path => [ "/location1/*.log","/location2/*.log","/location2/*.log"]
    sincedb_path => "/etc/logstash/.sincedb-network"
  }
}

filter {
    grok {
      overwrite => [ "message", "host" ]
      patterns_dir => "/etc/logstash/logstash-2.1.1/vendor/bundle/jruby/1.9/gems/logstash-patterns-core-2.0.2/patterns"
      match => [
        "message", "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:host} %%{CISCOTAG:ciscotag}: %{GREEDYDATA:message}",
        "message", "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:host} %{GREEDYDATA:message}"
      ]
     }
    grok {
      match => [
        "message", "%{CISCOFW106001}",
        "message", "%{CISCOFW106006_106007_106010}",
        "message", "%{CISCOFW106014}",
        "message", "%{CISCOFW106015}",
        "message", "%{CISCOFW106021}",
        "message", "%{CISCOFW106023}",
        "message", "%{CISCOFW106100}",
        "message", "%{CISCOFW110002}",
        "message", "%{CISCOFW302010}",
        "message", "%{CISCOFW302013_302014_302015_302016}",
        "message", "%{CISCOFW302020_302021}",
        "message", "%{CISCOFW305011}",
        "message", "%{CISCOFW313001_313004_313008}",
        "message", "%{CISCOFW313005}",
        "message", "%{CISCOFW402117}",
        "message", "%{CISCOFW402119}",
        "message", "%{CISCOFW419001}",
        "message", "%{CISCOFW419002}",
        "message", "%{CISCOFW500004}",
        "message", "%{CISCOFW602303_602304}",
        "message", "%{CISCOFW710001_710002_710003_710005_710006}",
        "message", "%{CISCOFW713172}",
        "message", "%{CISCOFW733100}",
        "message", "%{GREEDYDATA}"
      ]
    }
    syslog_pri { }
    date {
      "match" => [ "syslog_timestamp", "MMM  d HH:mm:ss",
                   "MMM dd HH:mm:ss" ]
      target => "@timestamp"
    }
    mutate {
      remove_field => [ "syslog_facility", "syslog_facility_code", "syslog_severity", "syslog_severity_code"]
    }
}

output {
    elasticsearch {
    hosts => ["server1","server2","server3"]
    index => "network-%{+YYYY.MM.dd}"
    template => "/etc/logstash/logstash-2.1.1/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.2.0-java/lib/logstash/outputs/elasticsearch/elasticsearch-network.json"
    template_name => "network"
 }
}

Best Answer

It's possible to tell LS to start more workers per instance with the -w N command-line option, where N is a number.

That should increase your event throughput substantially.

I don't know your exact server layout, but it's probably safe to start half as many workers as you have cores on your LS boxes, but adjust that based on what other functions it's performing.