Unexplicable GlusterFS load and poor performance

glusterfs

We are running GlusterFS replication server on 2 nodes and we have 2 clients. Self healing daemon is enabled. Each client is connected to a different server using the Gluster client. We have a lot of very small files on the Gluster volume.

We use GlusterFS 3.9.1 from the official GlusterFS APT Repositories on a Debian Jessie system.

The issue we are facing with is that one of the servers is at 0.1-0.5 load average while the other is at 200. The server with the high load also has a huge amount of data being streamed constantly to both the client nodes.
This data stream continues even when the clients are not reading writing data.

The following values are measured by nload and iftop :

server: outgoing 35-40 MB/s

Two clients: incoming 17-20 MB/s

Our performance on the Gluster client is very poor. An ls can take up to 10 second to complete, and our app works extremely slow.

Servers and clients are connecting over an internal data-center network and can handle much more bandwidth so this is not a limiting factor.

My two main questions are :

1: Are these differences in server load normal behavior for GlusterFS and what causes this?

2: Why is there such a high constant data stream to clients form one of the servers.

I cannot seem to find any information concerning this in the Gluster Documentation or on the Internet.

Best Answer

> 1: Are these differences in server load normal behavior for GlusterFS and what causes this?

Have a deeper look at the source of the load. Where is the bottleneck? CPU/Disk-IO/... (tools e.g. top, iotop)

Maybe the high load is based on io-wait.

Check if there is enough free memory, so gluster could use the cache.

> 2: Why is there such a high constant data stream to clients form one of the servers.

Verify which program is sending which data to which host. nload and iftop give you an idea about the traffic for the whole network interface. so try nethogs which gives you the traffic (dev, sent, received) by PID.

Writes from clients have to be written on current gluster-server. The current gluster-server has to send the file the the other gluster-server. When both servers have written the file, the client gets the acknowledgement.

Maybe this procedure "doubles" the network traffic on the gluster-server!? (see network-monitoring tools... which process and to where goes the traffic)

general performance concerns:

Check for the network-latency and see the blog post of Joe Julian (search for "Across high-latency connections")

The ls for 10 seconds could be "normal" if there are very much files in the directory. This is because every the metadata for every file has to be queried from the gluster-server. See this post explaing performance nfs vs. gluster-client for a little more explanation.

Maybe http:// blog.gluster.org/2016/10/gluster-tiering-and-small-file-performance/ is interesting for you. For me, it helps a little bit with small-file-performance.