Database vs Text File – When to Prefer Database Over Parsing Data from a Text File?

databasedesignpythonstorage

I was making a Python program to measure the growth of codereview.SE. My approach was to get the "Site stats" shown on the front page and store them on my hard drive. I plan to do this once every day. So far I have made enough to get the stats and append them to a text file. The python script can be viewed on github. The format I am using is the following

22-08-2013

questions 9073
answers 15326
answered 88
users 26102
visitors/day 7407

22-08-2013

questions 9073
answers 15326
answered 88
users 26102
visitors/day 7407

I just ran the script twice to get the format I would be using in the file. Initially this seemed good to me because I would be storing it myself and the format would be the same so it would be easily parsed but not I am not sure. It seems that using a database should be a better here because that way retrieving data should be easier. Just a note, I have never used any database and have no knowledge of SQL, MySQL or any other variants of RDBMS.

So this brings me to the question. When should a database be preferred for storing the data over storing the data in a text file? Are there some pointers that I can look for when making decisions about whether I need a database or simple text files?

PS: If better tags can be added please do so. I had some doubts about the tags which could be added.

Best Answer

When should a database be preferred for storing the data over storing the data in a text file?

Wikipedia tells us that a database is an organized collection of data. By that measure, your text file is a database. It goes on to say:

The data are typically organized to model relevant aspects of reality in a way that supports processes requiring this information. For example, modeling the availability of rooms in hotels in a way that supports finding a hotel with vacancies.

That part is subjective -- it doesn't tell us specifically how the data should be modeled or what operations need to be optimized. Your text file consists of a number of distinct records, one for each day, so you're modeling an aspect of reality in a way that's relevant to your problem.

I realize that when you say "database" you're probably thinking of some sort of relational database management system, but thinking of your text file as a database changes your question from "when should I use a database?" to "what kind of database should I use?" Seeing things in that light makes the answer easier to see: use a better database when the one you've got no longer meets your requirements.

If your Python script and simple text file work well enough, there's no need to change. With only one new record per day and computers getting faster each year, I suspect that your current solution could be viable for a long time. A decade's worth of data would give you only 3650 records that, once parsed, would probably require less than 75 kilobytes.

Imagine that instead of one small record per day, you decided to record every question asked on CodeReview, who asked it, and when. Furthermore, you also collect all the answers and the relevant metadata. You could store all that in a text file, but a flat file would make it difficult to find information when you needed it. There'd be too much data to read the whole thing into memory, so whenever you wanted to find a question or answer, you'd have to scan through the file until you found what you were looking for. When you wanted to find all the questions asked by a given user, you'd have to scan through the entire file. If you wanted to find all the questions that have "bugs" as a tag, you'd have to scan through the file.

That'd be horribly slow, so you might decide to speed things up by building some indexes that tell you where to look in the file to find a given record. You could have an index for questions, another for users, a third for answers, and so on. When you wanted to find a question you'd search the (much smaller) question index, get the position of the question in the main data file, and jump quickly to the right spot in the file. That'd be a big performance improvement. Indeed, that's pretty much what a database management system is.

So, use a DBMS when it's what you need. Use it when you have a lot of data, when you need to be able to access that data quickly and perhaps in ways that you can't entirely predict at the outset. If you have different kinds of data -- different types of records -- that are connected to each other, use a RDBMS so that you can relate the various records appropriately.

Related Topic