Most open source Content Management Systems use a database to store field data and then use files in the server's directory for config settings. Why is this model chosen so often and why would it not be as effective to maintain a file on the server to store field data in? Would this change if the goal is a large, professional site with lots of data?
Web Development – Why Do Many CMS Use a Database Instead of Local Files?
data structuresdesignweb-development
Related Solutions
Have you looked at other CMS and how they approach the problem?
For example, Drupal configuration responds to all your current needs and shows one possible way to not being limited by the constraints of INI.
You may also consider:
- Ordinary PHP files (pros: simple, no parser needed; cons: the end user can screw the file and break the entire app),
- INI (pros: still simple; cons: the flat structure is limiting),
- XML (pros: no limits of INI; cons: not user-friendly, too much markup),
- JSON (pros: less markup compared to XML; cons: still not user-friendly),
- YAML (pros: humanly-readable; cons: the end-user can still screw the structure),
- Custom page which allows the user to configure every aspect of the application though web interface (pros: user-friendly; cons: the excessive cost of developing the solution).
If you're working with end-users who don't have technical background, the last solution is the only one viable.
Applications like PHPBB show how to make two-step configuration: when the application runs for the first time, it executes setup.php which asks for the connection string, the root password, etc. Once the app is installed, the administrator can log in to access the administration panel.
Storing files in memory doesn't seem such a bad idea if there are only a few of them. 1Mb of data isn't very much on modern servers so it's all down to the level of simultaneous users you have and what happens when you 'run out' of memory.
In my experience I try to reduce the memory footprint size of each web service request because the server is usually a better judge of caching and allocating memory between processes and it's sometimes difficult to judge or even test how an application scales when the number of users increase and the memory pool is reducing rapidly as a result. If you go down this route then I'd recommend very careful load testing if memory is even slightly under stress.
Also, if you keep the pre-checked file in memory only then debugging/auditing the upload/edit/save cycle becomes harder because you have no record of what the file looked like before it was reviewed and saved so if you have any issue with the editing code long term (or a user even disputes what your editing code does) you can't trace back and look at the code's effect, even for a short period of hours or days. In memory processing needs more effort with debug/audit tools creation.
Alternatively, two possible suggestions:
Have you considered running a second database instance with staging/temporary working tables in that database only? This would keep your main database 'clean' and (possibly expensive) deletion and post-deletion optimisation functions off your main database too. You could also store a 'purge' time on the table so a scheduled job then deletes old tables in case of a no-show in terms of a user never clicking the 'save' button. (They'd be prompted to re-upload their file if they left the gap between upload and save beyond a sensible threshold).
If you're not expecting too many concurrent users of this functionality then you could also consider using an in memory cache (such as memcached) to store the file whilst it is being reviewed etc. A 1Gb memcached would store up to a thousand of your files simultaneously so unless you were under very heavy load, this would save you having to prompt for a re-upload from the user because the cache had flushed out your file. Note that this solution doesn't necessary preclude a good audit trail, if you choose to audit then you can turn on a simple script that reads memcached entries and writes them to disk for later audit without affecting live code.
Best Answer
The filesystem is a database, specifically a kind of hierarchical key-value store with a bit of added metadata, depending on the file system.
Using a filesystem is often appropriate, especially for large records. (Filesystems often have a minimum record size on the order of kilobytes.) When used correctly, filesystems have atomicity and durability guarantees similar to a database. In particular, both databases and journaling filesystems use write-ahead logging.
But compared with other databases, filesystems have four severe limitations:
A file system cannot be shared between multiple servers. But multiple servers may connect to a single database server.
In a filesystem, files are only indexed by their path. You cannot add secondary indexes, unless you add another path that is a soft link to the primary key or a hard link to the contents. When you change a file, you must manually update these indices.
While operations on single files can be performed atomically, it is not possible to create transactions that involve multiple files. The normal workaround is to have a lock file. When a process obtains a lock on this file, they have exclusive read and write access. Note that during the lock the files may be in an inconsistent state, so no other reads should occur. When the process crashes it might leave the file system in an inconsistent state. In contrast, RDBMS have proper transactions that can roll back changes on failure, and can allow reads of the pre-transaction state for other clients.
The file system has no means to ensure consistency of the data. For example, foreign key constraints are entirely impossible. This makes it much more difficult to model your data, and much more difficult to write correct applications that read and write that data.
A less severe limitation is that file systems may be slower than databases. But this depends a lot on the use case and the file system. E.g. older file systems such as ext2 have severely degraded performance when a directory contains too many entries.
Some applications are a good fit for using the file system as a database. For example, Git does this quite effectively but also has a number of unique properties, such as using immutable records.
For most applications, a RDBMS is preferable: you get a lot of guarantees (e.g. durability and consistency) for almost free, without having to implement them in your application. You can model relations in your data, not just a key-value mapping. Of course these features have a cost, but the cost is unnoticeable for the majority of applications.
So when there are multiple processes, a RDBMS server is preferable.
If you only have a single process, you might still like the consistency guarantees of a RDBMS. But instead of using a separate server you can use an embedded database engine such as SQLite. This can be a lot faster than a database server since your queries aren't transmitted over a network. And it can be faster than using the file system: instead of opening multiple files for each data access, the database engine just skips around in a single file.
Most CMSes are implemented in PHP. The PHP request model (short-lived processes, and many processes may be running in parallel) does not lend itself to embedded databases such as SQLite and therefore also makes it difficult to use the file system correctly as a simple DB. Instead, external database servers such as MariaDB or PostgreSQL are preferable.
If a system stores its data in a database, it is not necessary to also store configuration in a database. Typically the configuration is just read once at startup and is not modified by the application. The configuration data also typically has few relations, compared with the problem domain concepts of a CMS, e.g. the relations between posts, tags, users, access rights, ….