When should a database be preferred for storing the data over storing the data in a text file?
Wikipedia tells us that a database is an organized collection of data. By that measure, your text file is a database. It goes on to say:
The data are typically organized to model relevant aspects of reality in a way that supports processes requiring this information. For
example, modeling the availability of rooms in hotels in a way that
supports finding a hotel with vacancies.
That part is subjective -- it doesn't tell us specifically how the data should be modeled or what operations need to be optimized. Your text file consists of a number of distinct records, one for each day, so you're modeling an aspect of reality in a way that's relevant to your problem.
I realize that when you say "database" you're probably thinking of some sort of relational database management system, but thinking of your text file as a database changes your question from "when should I use a database?" to "what kind of database should I use?" Seeing things in that light makes the answer easier to see: use a better database when the one you've got no longer meets your requirements.
If your Python script and simple text file work well enough, there's no need to change. With only one new record per day and computers getting faster each year, I suspect that your current solution could be viable for a long time. A decade's worth of data would give you only 3650 records that, once parsed, would probably require less than 75 kilobytes.
Imagine that instead of one small record per day, you decided to record every question asked on CodeReview, who asked it, and when. Furthermore, you also collect all the answers and the relevant metadata. You could store all that in a text file, but a flat file would make it difficult to find information when you needed it. There'd be too much data to read the whole thing into memory, so whenever you wanted to find a question or answer, you'd have to scan through the file until you found what you were looking for. When you wanted to find all the questions asked by a given user, you'd have to scan through the entire file. If you wanted to find all the questions that have "bugs" as a tag, you'd have to scan through the file.
That'd be horribly slow, so you might decide to speed things up by building some indexes that tell you where to look in the file to find a given record. You could have an index for questions, another for users, a third for answers, and so on. When you wanted to find a question you'd search the (much smaller) question index, get the position of the question in the main data file, and jump quickly to the right spot in the file. That'd be a big performance improvement. Indeed, that's pretty much what a database management system is.
So, use a DBMS when it's what you need. Use it when you have a lot of data, when you need to be able to access that data quickly and perhaps in ways that you can't entirely predict at the outset. If you have different kinds of data -- different types of records -- that are connected to each other, use a RDBMS so that you can relate the various records appropriately.
I don't think it is a great idea to preload 2GB of log data. That would make for an unbearable user experience, and if you run this thing on a production server you will set off a bunch of alarms in the NOC.
I would focus on keeping a small memory footprint and reading as little of the file as possible. There are ways to search the file without actually loading it all. Some common use cases:
User wants to see log data from a certain portion of the file, as indicated by dragging a scrollbar.
- Open a file stream on the file and compute the target of a Seek operation by multiplying [total file length] * [scrollbar percentage].
- Seek the target location
- Read until the next newline character; discard
- Start reading and displaying records until the screen is full
User wants to see log data for a certain time period
- Open the file stream on the file
- Divide the file in half; seek the halfway point
- Skip to the next newline character, then read the next record
- Parse the date/time stamp.
- If the date/time is just right, stop
- If the date/time is too low, repeat the above operations on the second half of the file (so in step 2 you're actually finding the 3/4th mark)
- If the date/time is too high, repeat the above operations on the first half (so in step 2 you're actually finding the 1/4th mark)
- Keep recursing until you find the time range that is desired. This is known as a binary search.
Best Answer
Plaintext is binary.
When you write an
H
to a hard drive, the write head doesn't carve two vertical lines and a horizontal line into the platter, it magnetically encodes the bits01001000
1 into the platter.From there, it should be obvious that storing plain text data takes up exactly the same amount of space as storing binary data.
But plaintext is just one2 particular binary format
Plaintext can be reversibly transformed into other binary formats. One common transformation is compression which usually results in a more compact representation, meaning fewer bits used to represent the same information.
Depending on what you're using the plaintext to represent, you may be able to use different binary formats to represent the same information. This may use more space, it may use less.
For example, the numbers
5
and1234567
could be represented in plaintext using digit characters, resulting in these bit sequences on disk3:Alternatively, you could use 32-bit two's complement:
Which is a less compact representation of
5
, but more compact representation of1234567
.And there is a literally infinite number of other representations which would have varying levels of compactness, and flexibility, although, in practice far less than that many representations are actually used.
1 Assuming UTF-8. The exact sequence of bits for a character depends on which specific encoding you're using.
2 Or really, several formats, given the various encodings.
3 If you're wondering what those eight zeros on the ends are, well, you need some way of knowing how long the data is. The options basically boil down to a marker (I used this, via a null byte), space dedicated to storing the length (Pascal used a byte to store the length of a string), or a fixed size (used in the subsequent two's complement example).