Cpu overuse replicating a Gluster Volume

glusterfs

I've this scenario:

srv01
srv02
srv03

there is a gluster volume "vol1" running on srv03, and all the servers can use for i/o. vol1 contains a lot of mixed side images, ranging from few kbs to 3-4Mb, The total amount is about 1.5TB.

Gluster version is 3.6.2

It's not a silver bullet, need some tuning, but works pretty well.

Now I've to replicate srv03's brick to the other servers.

The problem is that srv03's cpu skyrockets to 100% and cannot serve normal
requests. Net traffic is low.

Options are:

cluster.data-self-heal-algorithm: full

cluster.self-heal-daemon: off

performance.cache-size: 1gb

I've to keep the service running while the replication is running, Your suggestions are welcome

Best Answer

I am somehow working on a similar situation. If your bottleneck is the CPU I think that decreasing cluster.background-self-heal-count should help (default is 16). In other words "when your client tries to open 17 files, it'll hang on the 17th waiting for a self-heal" (https://botbot.me/freenode/gluster/msg/45681458/).

Related Solutions

Why can’t I create this gluster volume

I was seeing an obscure error message about an unconnected socket with peer 127.0.0.1.

[2013-08-16 00:36:56.765755] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:1022)

It turns out the problem I was having was due to NAT. I was trying to create gluster servers that were behind a NAT device and use the public IP to resolve the names. This is just not going to work properly for the local machine.

What I had was something like the following on each node.

A hosts file containing

192.168.0.11  gluster1
192.168.0.12  gluster2
192.168.0.13  gluster3
192.168.0.14  gluster4

The fix was to remove the trusted peers first

sudo gluster peer detach gluster2
sudo gluster peer detach gluster3
sudo gluster peer detach gluster4

Then change the hosts file on each machine to be

# Gluster1
127.0.0.1     gluster1
192.168.0.12  gluster2
192.168.0.13  gluster3
192.168.0.14  gluster4

# Gluster2
192.168.0.11  gluster1
127.0.0.1     gluster2
192.168.0.13  gluster3
192.168.0.14  gluster4

etc

Then peer probe, and finally create the volume which was then successful.

I doubt that using IP addresses (the public ones) will work in this case. It should work if you use the private addresses behind your NAT. In my case, each server was behind a NAT in the AWS cloud.

“gluster volume status” errors out

As per the network.frame-timeout option, after 1800 seconds (30 minutes) the "operation has to be declared as dead, if the server does not respond for a particular operation" (old, but possibly still valid: http://www.gluster.org/community/documentation/index.php/Gluster_3.2:_Setting_Volume_Options#network.frame-timeout).

The logs could also contain precious information.

Best Answer

Related Solutions

Why can’t I create this gluster volume

“gluster volume status” errors out

Related Topic