This is old, but I thought I would write this method which I use for low/medium traffic site (don't know if it will work well for heavy traffic site):
In Apache, I define a CustomLog format called graylog2_access
which formats the access log into a GELF format and then I send my log through Graylog2 by piping the log data through nc to send GELF messages to Graylog2's input.
Here is the custom format that it creates (human readable):
{
"version": "1.1",
"host": "%V",
"short_message": "%r",
"timestamp": %{%s}t,
"level": 6,
"_user_agent": "%{User-Agent}i",
"_source_ip": "%a",
"_duration_usec": %D,
"_duration_sec": %T,
"_request_size_byte": %O,
"_http_status": %s,
"_http_request_path": "%U",
"_http_request": "%U%q",
"_http_method": "%m",
"_http_referer": "%{Referer}i"
}
For the Apache config, here is a copy/paste version:
LogFormat "{ \"version\": \"1.1\", \"host\": \"%V\", \"short_message\": \"%r\", \"timestamp\": %{%s}t, \"level\": 6, \"_user_agent\": \"%{User-Agent}i\", \"_source_ip\": \"%a\", \"_duration_usec\": %D, \"_duration_sec\": %T, \"_request_size_byte\": %O, \"_http_status\": %s, \"_http_request_path\": \"%U\", \"_http_request\": \"%U%q\", \"_http_method\": \"%m\", \"_http_referer\": \"%{Referer}i\" }" graylog2_access
Then in your host configuration:
CustomLog "|nc -u graylogserver 12201" graylog2_access
Try this LEGACY rsyslog formatted version:
# Forward apache logs to graylog2 server
$ModLoad imfile # needs to be done just once
$InputFileName /var/log/httpd/access.log
$InputFileTag ApacheAccessLog:
$InputFileStateFile access.log.statefile
$InputFileFacility local4
$InputFileSeverity info
$InputRunFileMonitor
$InputFileName /var/log/httpd/error.log
$InputFileTag ApacheErrorLog:
$InputFileStateFile error.log.statefile
$InputFileFacility local4
$InputFileSeverity error
$InputRunFileMonitor
local4.* @@log.ospreyreach.com:12514
& stop
You can do similar entries for your other log files.
After that, create some extractors on your graylog2 server for the 12514/TCP input.
This will give you some fine grain options for graphs etc.
Best Answer
The answer is simple: to save space and memory. If you kept everything, you would run out of space at some point. Every index also requires a certain amount of memory so having more indices open will cause the cluster to run out of RAM eventually. This function is just a simple way to configure how much space you want graylog to take. If you want to keep more indices, simply increase the number of maximum indices.
The indices are numbered sequentially, and you can restore an older index and access it if you really have to.