Reasons in favor of storing files in the database:
- ACID consistency including a rollback of an update which is complicated when the files are stored outside the database. This isn't to be glossed over lightly. Having the files and database in sync and able to participate in transactions can be very useful.
- Files go with the database and cannot be orphaned from it.
- Backups automatically include the file binaries.
Reason against storing files in the database:
- The size of a binary file differs amongst databases. On SQL Server, when not using the FILESTREAM object, for example, it is 2 GB. If users need to store files larger (like say a movie), you have to jump through hoops to make that magic happen.
- Increases the size of the database. One general concept you should take to heart: The level of knowledge required to maintain a database goes up in proportion to the size of the database. I.e., large databases are more complicated to maintain than small databases. Storing the files in the database can make the database much larger. Even if say a daily full backup would have sufficed, with a larger database size, you may no longer be able to do that. You may have to consider putting the files on a different file group (if the database supports that), tweak the backups to separate the backup of the data from the backup of the files etc. None of these things are impossible to learn, but do add complexity to maintenance which means cost to the business. Larger databases also consume more memory as they try to stuff as much data into memory as possible.
- Portability can be a concern if you use system specific features like SQL Server's
FILESTREAM
object and need to migrate to a different database system.
- The code that writes the files to the database can be a problem. One company for whom I consulted not so many moons ago at some point connected a Microsoft Access frontend to their database server and used Access' ability to upload "anything" using its Ole Object control. Later they changed to use a different control which still relied on Ole. Much later someone changed the interface to store the raw binary. Extracting those Ole Object's was a new level of hell. When you store files on the file system, there isn't an additional layer involved to wrap/tweak/alter the source file.
- It is more complicated to serve up the files to a website. In order to do it with binary columns, you have to write a handler to stream the file binary from the database. You can also do this even if you store file paths but you don't have to do this. Again, adding a handler is not impossible but adds complexity and is another point of failure.
- You cannot take advantage of cloud storage. Suppose one day you want to store your files in an Amazon S3 bucket. If what you store in the database are file paths, you are afforded the ability to change those to paths at S3. As far as I'm aware, that's not possible in any scenario with any DBMS.
IMO, deeming the storage of files in the database or not as "bad" requires more information about the circumstances and requirements. Are the size and/or number of files always going to be small? Are there no plans to use cloud storage? Will the files be served up on a website or a binary executable like a Windows application?
In general, my experience has found that storing paths is less expensive to the business even accounting for the lack of ACID and the possibility of orphans. However, that does not mean that the internet is not legion with stories of lack of ACID control going wrong with file storage but it does mean that in general that solution is easier to build, understand and maintain.
Note: if you stated the purpose of your file container more clearly, describing access patterns, desired platforms, and the problem you're solving in general, the answers might be better.
Your description looks awfully similar to a game resource file, a renowned file type. These files are not intended to be updated frequently (if ever), but are optimized for fast seeking and reading.
There are several known implementations: for instance, iD Software used WAD files and PAK files, but finally came to use ZIP files.
Many applications use various derivatives of IFF format, built from self-describing chunks, and, with some care, efficiently updatable.
In a chunked file, or a zip file (consider zero compression), you can encrypt each entry independently, before writing it into the file. Provided that you use a block cipher like AES256, your data does not change size, except for aligning it to block boundary. You definitely want your decryption key stored somewhere else :)
Writing a file system to provide encryption is not only possible, but has actually been done many times. For instance, Linux has encfs, and Windows has encrypted folders. TrueCrypt is an advanced virtual encrypted FS available on many platforms.
Please note that while zip files provide for a sort of native encryption, this encryption is relatively weak. Same applies to rar files, for all I know.
Best Answer
Where to store the files ?
The question of whether or not to store the files in the database has to be considered under several angles:
Consistency: storing metadata and the files (as BLOB) together in the database ensures that what belongs together remain together. No fear of inconsistency if insertion is interupted, no separate storage location to manage with absolute or relative url in the database.
Convenience: you may move/backup/replicate/monitor your database if you need just using the database tools. With separate files, you have to organize all theser operations. It is not necessarily difficult, but you have to take care of it.
Security: most DBMS offer you some authorization mechanismsfor user access, and even encryption if needed. So having the file in the DMBMS ensures that nobody tampers with the files, and only those having the necessary DB priviledges can access it. With separate files outside the database, its much more difficult to organize this (unless you're on a server and the clients can't access directly to the folders).
Performance: this is something that you have to check carefully with the DBMS that you'll choose: the API for accessing BLOBs might require some overhead to transfer from/to the database in smaller chunks. So you need to be careful to request this object only if necessary. Here with files in the file system, it's faster to access the raw data when it is needed. However with so many files, you might have to distribute them across several folders, in order not to to suffer from the search performance of every filename in a huge directory.
Resources: If you would consider using some in-memory database for accelerating your "semantic" work on metadata, then it would be very costly to store as well all the plain data in memory. There, separate files could really be of advantage.
Not knowing what exactly your application is doing, it wouldn't be wise to advise you firmly on one way or the other.
Real life examples
So in practice, both approaches will work. The first is based on standard software products. The second was developed in-house. From the point of view of performance, they both are very similar because images are accessed from the client (i.e. the potential overhead in BLOB management on DB side, are compensated by the overhead of an additional transfer with the additional webserver).
Relational or not ?
If you go relational, you may want to manage:
There is no doubt that performance and the flexibility will be there, because searching for keys, merging several searches, etc.. is the core business of an RDBMS. But you'll have to work out how to best structure the metadata.
You could also opt for a NoSQL database. They are more flexible on the data structure. Intuitively I'd suggest to start to have a look at the document databases. If however you prefer to keep the files out of the database, you could be more interested in a key-value store, or even a large column store if you'd manage different kind of keywords for different kind of metadata.