Ubuntu – Is openldap fit for large production deployments

ldapopenldapUbuntu

For about 1 year we've been using openldap on ubuntu server 10.04LTS for authenticating about 20 IT users and everything has been running fine (the operations on the LDAP server were basically limited to creating/removing users using apache directory studio).

More recently (6 months ago) we've also started implementing openldap (openldap-2.4.21/debian) as an external authentication system for our website which is being migrated from an external CMS to a new platform we're developing in house using Drupal CMS. We have a 45K-user database and things haven't been going smoothly at all. Issues that we've had are:
ldap crashing after a backup restore, needing to be recovered.
the ldap recover tool unable to recover the ldap database on some occassions
-slapd consuming 100% CPU while no authentication activity on the website.

Due to lack of resources and knowledge internally, all we've done so far is to find ways of keeping LDAP running without really investigating any of these issues (use monit to restart it when it crashes, db_recover to recover the db if needed, and slapcat to recreate the db from scratch when db_recover fails).

Recently we've had a round of interviews to hire a Senior infrastructure engineer to assist us with all the various infra. issues we're running into. Several candidates confirmed they've either had or heard about issues with openldap in large production environments and never managed to come up with a single stable standalone openldap server but instead had to come up with redundant deployments (replication, load balancing, auto-recovery/restart routines) to keep ldap running. Some candidates even said that openldap just wasn't fit for production environments and that instead, using alternatives such as Novel eDirectory was necessary.

Q: If you have experience in dealing with ldap in production environments with thousands of users, do you have facts to share which tend to prove that openldap is indeed unstable for such setups and that using other ldap servers are indeed recommended?

Best Answer

I use OpenLDAP supporting a user-base of about 10,000 active users who rely on it throughout the day for everything. Problems are rare. Many services rely on it, for authentication and other things.

However, we have 4 read-only replicas (slaves/consumers) behind a load-balancer, a hidden master and a hot standby master. Used to be 2 front-end servers, but we had load problems during certain peak times (when 4,000 or so of those users were desperately trying to hit it at the same second). All write access to LDAP is via our code.

That equipment and OS is all old and we're working on replacing it with a new setup that will go back to only 2 replicas (that aren't doing as many other things) and "mirror mode" replication between a pair of masters in an HA configuration. Again, problems are rare.

We used to have some problems with replication failing, but that's mostly from when we were using slurpd instead of syncrepl. Also, unclean shutdowns of a server can corrupt the data.

Keys to running OpenLDAP in a large-scale production environment, in my experience:

  1. Somebody that understands LDAP and OpenLDAP well. Preferably more than one somebody.
  2. Somebody that understands all the other directly related parts of the infrastructure well.
  3. Somebody that understands how OpenLDAP replication works.
  4. A reasonable understanding of the BerkeleyDB options (or whatever backend you're using), since the defaults aren't quite right.
  5. Highly available slaves. More than 1. Better: really load-balanced.
  6. **Active-passive masters (active-active master replication is inherently tricky)
  7. We back up LDAP data to LDIF every hour and keep a few days worth of those on disk. (the whole server gets backed up nightly)
  8. We have scripts to quickly bring a broken slave back to a clean current data replica
  9. We have scripts to quickly restore a broken master from the LDIF backups (via slapadd)
  10. We can quickly switch to the standby master. (scripts)
  11. We monitor that the replication connections are alive
  12. We monitor that the replications IDs are current on all slaves
  13. We monitor (less often) that the entire contents of the slaves match the master.

Basically, though, if it's a key part of your infrastructure, somebody on your team should really understand it well.

Addendum: By request, the DB_CONFIG file from my openldap DB directory. Look at http://docs.oracle.com/cd/E17076_02/html/api_reference/C/configuration_reference.html for details.

set_cachesize 0 536870912 1
set_flags DB_TXN_NOSYNC
set_flags DB_TXN_WRITE_NOSYNC
set_lg_regionmax 268435456
set_lg_max 536870912
set_lg_bsize 134217728