We are in the process of deploying an ELK stack and need advice and general recommendations regarding the performance of the cluster and more specifically, logstash.
So the current setup we have now is that we have 1 kibana node, 2 logstash nodes and 4 elastic nodes. The logstash nodes are using 8 vCPUs and 32 GB RAM each and are being fed syslog data using nginx as a load balancer. The elastic nodes have 8 vCPUs and 64 GB RAM each. The heap size have been set to ½ of RAM for all nodes.
We are currently processing about 4-5000 events/second but are planning to increase to much more events/second. With the current amount of events we are seeing that both logstash nodes are using about 90% CPU. Now we do process the logs before moving them to elastic with a few filters. Here they are:
3000-filter-syslog.conf:
filter {
if "syslog" in [tags] and "pre-processed" not in [tags] {
if "%ASA-" in [message] {
mutate {
add_tag => [ "pre-processed", "Firewall", "ASA" ]
}
grok {
match => ["message", "%{CISCO_TAGGED_SYSLOG} %{GREEDYDATA:cisco_message}"]
}
syslog_pri { }
if "_grokparsefailure" not in [tags] {
mutate {
rename => ["cisco_message", "message"]
remove_field => ["timestamp"]
}
}
grok {
match => [
"message", "%{CISCOFW106001}",
"message", "%{CISCOFW106006_106007_106010}",
"message", "%{CISCOFW106014}",
"message", "%{CISCOFW106015}",
"message", "%{CISCOFW106021}",
"message", "%{CISCOFW106023}",
"message", "%{CISCOFW106100}",
"message", "%{CISCOFW110002}",
"message", "%{CISCOFW302010}",
"message", "%{CISCOFW302013_302014_302015_302016}",
"message", "%{CISCOFW302020_302021}",
"message", "%{CISCOFW305011}",
"message", "%{CISCOFW313001_313004_313008}",
"message", "%{CISCOFW313005}",
"message", "%{CISCOFW402117}",
"message", "%{CISCOFW402119}",
"message", "%{CISCOFW419001}",
"message", "%{CISCOFW419002}",
"message", "%{CISCOFW500004}",
"message", "%{CISCOFW602303_602304}",
"message", "%{CISCOFW710001_710002_710003_710005_710006}",
"message", "%{CISCOFW713172}",
"message", "%{CISCOFW733100}"
]
}
}
}
}
3010-filter-jdbc.conf:
filter {
if "syslog" in [tags] {
jdbc_static {
loaders => [
{
id => "elkDevIndexAssoc"
query => "select * from elkDevIndexAssoc"
local_table => "elkDevIndexAssoc"
}
]
local_db_objects => [
{
name => "elkDevIndexAssoc"
index_columns => ["cenDevIP"]
columns => [
["cenDevSID", "varchar(255)"],
["cenDevFQDN", "varchar(255)"],
["cenDevIP", "varchar(255)"],
["cenDevServiceName", "varchar(255)"]
]
}
]
local_lookups => [
{
id => "localObjects"
query => "select * from elkDevIndexAssoc WHERE cenDevIP = :host"
parameters => {host => "[host]"}
target => "cendotEnhanced"
}
]
# using add_field here to add & rename values to the event root
add_field => { cendotFQDN => "%{[cendotEnhanced[0][cendevfqdn]}" }
add_field => { cendotSID => "%{[cendotEnhanced[0][cendevsid]}" }
add_field => { cendotServiceName => "%{[cendotEnhanced[0][cendevservicename]}" }
remove_field => ["cendotEnhanced"]
jdbc_user => "user"
jdbc_password => "password"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_driver_library => "/usr/share/java/mysql-connector-java-8.0.11.jar"
jdbc_connection_string => "jdbc:mysql://84.19.155.71:3306/logstash?serverTimezone=Europe/Stockholm"
#jdbc_default_timezone => "Europe/Stockholm"
}
}
}
Is there any way to debug what is taking so much CPU power? Do anyone have any recommendations as to what to do since we need to be able to process much more logs?
Here is the output from jstat:
jstat -gc 56576
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
68096.0 68096.0 0.0 68096.0 545344.0 66712.9 30775744.0 10740782.3 113316.0 93805.9 16452.0 13229.3 1341 146.848 6 0.449 147.297
Thanks
Best Answer
Here are some tips to help you along with your performance tuning mission.
Use multiple pipelines where possible
Logstash 6.0 introduced the possibility to easily run multiple pipelines. You can use this to split out event processing logic if it makes sense. E.g. you if can distinguish two or more types of inputs/outputs and their filtering processes in-between.
Have a read here and here for some tips on using multiple pipelines.
Conditional logic
Next up try to see if you can reduce the conditional logic in your filters at all. The more if..else logic you have the more CPU intensive things get for Logstash.
Get hold of some valuable stats to see what is causing high CPU usage
You should definitely use the Node Stats API for Logstash to see what is going on inside your current event processing pipeline.
You can also look up other stats types. (For example try pipelines as well as process). Check out this page for more info on using the API to query your Logstash stats. This will more than likely tell you where the really intensive stuff is happening.
Good luck!