Magento – Magento Enterprise 1.13.1 REDIS cluster with failover automation

cachemagento-enterpriseperformanceredis

One of the answers above mentions the use of:

Cache hosts (by parhamr)

There are two hosts running Redis in a master-slave configuration with automated failover. Three Redis instances are used to increase throughput and provide fine-tuning of persistence behaviors.

I can't seem to figure out a way to add more than just 1 redis instance using native magento enterprise 1.13.1 integration methods. How do you set this up in the config files to fail over from one redis instance to another or do you do read/write separation? I can't figure out a way unless I use dynamic dns entry or an additional load balancer?

Dynamic DNS won't do any good if one of the instances goes down, load balancer on the other hand will continuously monitor and utilize only "active" instances, but is there a solution on a code/config level that i can utilize out of the box?

Thank you.

Best Answer

There are two things going on, here: 1) division of Redis functionality across instances and 2) failover of Redis through Sentinel. My team uses load balancers specifically to support item 2.

Here’s how we did this for a production 1.12 cluster in mid-2013:

Multiple Redis Instances

Edit local.xml (Example config) to point <redis_session />, <cache />, and <full_page_cache /> at three different Redis instances. My team has chosen to run sessions on port 6382 (32 GB limit), backend cache on port 6383 (48 GB limit), and full page cache on port 6384 (12 GB limit).

This architecture is used to fine tune memory limits and RDB configurations for each cache type and we also can scale Redis to higher aggregate throughputs because Redis is single threaded. Provisioning this Ubuntu 12.04 LTS server required duplication of the /etc/redis/*.conf configuration files, /etc/init.d/redis* files, and calling update-rc.d $name defaults to ensure each Redis instance had its own log files, can independently be signaled, and that all instances are started on system boot.

Load Balancing

The production cluster has one primary server (cache01) and one failover server (cache02) with identical system specifications and Redis configurations (3 instances per server; see above).

My team had two highly available NetScaler appliances for which each Redis instance was defined as a service. We defined vservers for each instance on each host, e.g. production_cache_6382_primary. These vservers are using virtual interfaces for which private IP addresses have been allocated and the local.xml file points to. The request path is like this:

  1. Magento application sends request through PHP Redis client
  2. PHP Redis Client connects to IP address and port configured in local.xml
  3. Request is routed through LACP bonded switches
  4. The switches point to a HA pair of NetScalers
  5. NetScaler primary points to production_cache_6382_primary
  6. The request reaches the configured primary Redis instance

Redis Sentinel

My team has configured Redis Sentinel to use host cache02 as a secondary failover for host cache01. The NetScaler vserver production_cache_6382_primary is set to use the Redis service on port 6382 of host cache02 as its failover. Since the NetScaler TCP uptime check is initiated every 6 seconds, Sentinel has a generous window to promote the secondary service to the primary role. This environment allows Magento’s local.xml to have a static, hardcoded value for IP addresses of Redis instances and it allows our systems to automatically detect and switch which cache server is in service as primary. Here’s how this process works:

  1. NetScaler performs a TCP monitoring operation against the Redis service on port 6382 of host cache01, finds it working
  2. All client requests pointing to the production_cache_6382_primary vserver are sent to the above host
  3. Redis instance 6382 as primary on host cache01 is stopped
  4. Client connections to production_cache_6382_primary start experiencing connection and request failures
  5. Redis Sentinel independently detects the outage of instance 6382 on host cache01 and promotes the secondary instance on cache02 as primary
  6. Six seconds after step 1, the NetScaler performs a TCP monitoring operation against the Redis service on port 6382 of host cache01, finds it down
  7. The NetScaler performs its failover routine and starts sending client requests to the 6382 service on host cache02 instead of cache01
  8. Client connections and requests are still pointing at the same vserver IP address and start to experience successful reads and writes