Freebsd – Redis connection issue

connectionfreebsdredistime-wait

We are currently experiencing a lot of Redis errors with the message

Unable to connect: read error on connection, trying next server

We run Redis on FreeBSD using PHP Redis and we have a hard time reproducing the error on Ubuntu so this might be a hint. There's a long-running issue on that topic on github.

Basically we get a socket from the operating system with a call to connect(host, port, timeout) in phpredis, but when we do a select(db_index) afterwards, we get an exception.
Could there be an issue with persistance? I assume that connect does nothing in the background and select tries to access the connection, which is actually closed.

We don't run into a timeout. We tried tuning TIME_WAIT without success.

Any other ideas on where the problem might come from?
What is the best way to track the issue down? dtrace maybe?

Update

We are currently looking into our BGSAVE settings. Interestingly it takes half a second and more to create a fork for the process which regularly writes the data to disk (persistence) and maybe redis can't respond to connect() requests during that timespan.

Best Answer

We reduced the error rate by 90% with the following redis command:

CONFIG SET save ""

This disables BGSAVE, which regulary stores all database changes on disk. The reason for the connect errors most likely come from a blocking fork() operation of the main redis process to start the BGSAVE process.

The redis.conf says:

# Redis may block too long on the fsync() call. Note that there is no fix for
# this currently, as even performing fsync in a different thread will block
# our synchronous write(2) call.

Also see how the mechanism is implemented with a simple fork() here. We think about using a dedicated redis server from our pool which will be responsible for the BGSAVE operations and just using the others for reading/writing.

From IRC chat, it seems like a couple of other companies ran into the same error. Bump was using a master/slave system as well. The slave does not accept connections and is only there to persist the data (see the discussion on hackernews here)

Hulu says the following: "To keep performance consistent on the shards, we disabled the writing to disk across all the shards, and we have a cron job that runs at 4am everyday doing a rolling “BGSAVE” command on each individual instance." (see here)

Edit:

It turns out that this was just a temporary fix. Load increased and we are back at the high error rates. Nevertheless I'm quite confident that a background operation (e.g. a fork, or a short-running background process) is causing the errors as the error messages always appear in blocks.

Edit2:

Since Redis is single-threaded, always keep an eye on long-running operations because they block everything else. An example is the keys * command. Avoid it and use scan instead

Related Topic