Linux – Tuning zabbix: What is the number of proceesses deemed to be reasonable on a server

linuxmonitoringzabbix

Yes, so I am getting to grips (and loving) zabbix, and have started the process of fine-tuning the alerts.

I have this alert that is triggered on a linux server for having over 300 processes.

Now, this is sort of a central server that acts as a firewall and runs a bunch of stuff.. namely proxy/httpd-server/mysql/open-vpn/zabbix

Is there anything to look out for before I pop up the alert trigger to 350 processes?

The cpu load is still relatively quite low, I was thinking maybe one would check other stuff before upping the alerts.

Would I need to check if machine is bottle-necked elswhere, ie IO bound?

Any good advice for this or good documentation (hopefully well-written and easy to understand) as always would be greatly appreciated.

Best Answer

Like @sam said, it all depends on what the server is doing and how beefy is the server hardware. Running only handful of extremely CPU, memory and/or I/O intensive processes can easily overload even a powerful server. Especially if something makes your server swap, everything will be moving ahead slower than a snail or a turtle.

On the other hand, something like Postfix server can easily have the process count in hundreds, or thousands, as everything Postfix does is very light-weight.

In my opinion monitoring (or at least alerting because of) global process count is not useful. Though if you know for sure that there should not be more than X instances of some process around, then monitor that and raise an alert in an event there suddenly are more than X pieces of them around.

You can also graph amount of some processes for trends: for example, I tend to graph Cyrus IMAP/POP process count so I can see if they hover anywhere near current hard limits.

If you have some predictable process behaviours, you can use something like psmon for automatically restarting/killing (with optional logging / e-mailing for info about events psmon handled) misbehaving processes. Sure thing, Zabbix can be used for this, too, but psmon is very easy to configure for this kind of tasks.

What I would graph and monitor

In general, graph (and monitor) at least the following:

load average
memory usage
disk usage
cpu usage
amount of network traffic
amount of some individual processes if you need to
response times for your services
server uptime (can be a very useful graph; if some server starts to misbehave and needs to be rebooted often, it's easy to spot from the graphs the moment problems started)

Then monitor the at least the following:

are the processes that should be up responding correctly; in my opinion just testing if the port is up or if the process is present if not enough. Instead, if you want to check if web server is running, see if it returns HTTP 200 OK and preferably see if the test page contains some expected strings.
server ping. If ping fails, alert immediately.
kernel logs for severe things such as I/O errors, failed paths in SAN environment multipath configuration, kernel panics, OOM events, and so on

I hope this helps you. :)

Related Solutions

Linux Server Performance – What Limits Maximum Number of Connections?

I finally found the setting that was really limiting the number of connections: net.ipv4.netfilter.ip_conntrack_max. This was set to 11,776 and whatever I set it to is the number of requests I can serve in my test before having to wait tcp_fin_timeout seconds for more connections to become available. The conntrack table is what the kernel uses to track the state of connections so once it's full, the kernel starts dropping packets and printing this in the log:

Jun  2 20:39:14 XXXX-XXX kernel: ip_conntrack: table full, dropping packet.

The next step was getting the kernel to recycle all those connections in the TIME_WAIT state rather than dropping packets. I could get that to happen either by turning on tcp_tw_recycle or increasing ip_conntrack_max to be larger than the number of local ports made available for connections by ip_local_port_range. I guess once the kernel is out of local ports it starts recycling connections. This uses more memory tracking connections but it seems like the better solution than turning on tcp_tw_recycle since the docs imply that that is dangerous.

With this configuration I can run ab all day and never run out of connections:

net.ipv4.netfilter.ip_conntrack_max = 32768
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_orphan_retries = 1
net.ipv4.tcp_fin_timeout = 25
net.ipv4.tcp_max_orphans = 8192
net.ipv4.ip_local_port_range = 32768    61000

The tcp_max_orphans setting didn't have any effect on my tests and I don't know why. I would think it would close the connections in TIME_WAIT state once there were 8192 of them but it doesn't do that for me.

Linux – the maximum port number

(2^16)-1, or 0-65,535 (the -1 is because port 0 is reserved and unavailable). (edited because o_O Tync reminded me that we can't use port 0, and Steve Folly reminded me that you asked for the highest port, not the number of ports)

But you're probably going about this the wrong way. There are people who argue for and against non-standard ports. I say they're irrelevant except to the most casual scanner, and the most casual scanner can be kept at bay by using up-to-date software and proper firewall techniques, along with strong passwords. In other words, security best practices.

Best Answer

Related Solutions

Linux Server Performance – What Limits Maximum Number of Connections?

Linux – the maximum port number

Related Topic