Web Development – Why Do Many CMS Use a Database Instead of Local Files?

data structuresdesignweb-development

Most open source Content Management Systems use a database to store field data and then use files in the server's directory for config settings. Why is this model chosen so often and why would it not be as effective to maintain a file on the server to store field data in? Would this change if the goal is a large, professional site with lots of data?

Best Answer

The filesystem is a database, specifically a kind of hierarchical key-value store with a bit of added metadata, depending on the file system.

Using a filesystem is often appropriate, especially for large records. (Filesystems often have a minimum record size on the order of kilobytes.) When used correctly, filesystems have atomicity and durability guarantees similar to a database. In particular, both databases and journaling filesystems use write-ahead logging.

But compared with other databases, filesystems have four severe limitations:

  1. A file system cannot be shared between multiple servers. But multiple servers may connect to a single database server.

  2. In a filesystem, files are only indexed by their path. You cannot add secondary indexes, unless you add another path that is a soft link to the primary key or a hard link to the contents. When you change a file, you must manually update these indices.

  3. While operations on single files can be performed atomically, it is not possible to create transactions that involve multiple files. The normal workaround is to have a lock file. When a process obtains a lock on this file, they have exclusive read and write access. Note that during the lock the files may be in an inconsistent state, so no other reads should occur. When the process crashes it might leave the file system in an inconsistent state. In contrast, RDBMS have proper transactions that can roll back changes on failure, and can allow reads of the pre-transaction state for other clients.

  4. The file system has no means to ensure consistency of the data. For example, foreign key constraints are entirely impossible. This makes it much more difficult to model your data, and much more difficult to write correct applications that read and write that data.

A less severe limitation is that file systems may be slower than databases. But this depends a lot on the use case and the file system. E.g. older file systems such as ext2 have severely degraded performance when a directory contains too many entries.

Some applications are a good fit for using the file system as a database. For example, Git does this quite effectively but also has a number of unique properties, such as using immutable records.

For most applications, a RDBMS is preferable: you get a lot of guarantees (e.g. durability and consistency) for almost free, without having to implement them in your application. You can model relations in your data, not just a key-value mapping. Of course these features have a cost, but the cost is unnoticeable for the majority of applications.

So when there are multiple processes, a RDBMS server is preferable.

If you only have a single process, you might still like the consistency guarantees of a RDBMS. But instead of using a separate server you can use an embedded database engine such as SQLite. This can be a lot faster than a database server since your queries aren't transmitted over a network. And it can be faster than using the file system: instead of opening multiple files for each data access, the database engine just skips around in a single file.

Most CMSes are implemented in PHP. The PHP request model (short-lived processes, and many processes may be running in parallel) does not lend itself to embedded databases such as SQLite and therefore also makes it difficult to use the file system correctly as a simple DB. Instead, external database servers such as MariaDB or PostgreSQL are preferable.

If a system stores its data in a database, it is not necessary to also store configuration in a database. Typically the configuration is just read once at startup and is not modified by the application. The configuration data also typically has few relations, compared with the problem domain concepts of a CMS, e.g. the relations between posts, tags, users, access rights, ….

Related Topic