Data Storage – Does Storing Plain Text Data Take Up Less Space Than Binary?

binarycompressiondatadatabasestorage

As a web developer I have very little understanding of binary data.

If I take the sentence "Hello world.", convert it to binary, and store it as binary in an SQL database, it seems like the 1s and 0s would take up more space than the letters. It seems to me like using letters would sort of be like using compression, where one symbol stands for multiple.

But is that really how it works?

Does storing plain text data take up less space than storing the equivalent message in binary?

Best Answer

Plaintext is binary.

When you write an H to a hard drive, the write head doesn't carve two vertical lines and a horizontal line into the platter, it magnetically encodes the bits 010010001 into the platter.

From there, it should be obvious that storing plain text data takes up exactly the same amount of space as storing binary data.

But plaintext is just one2 particular binary format

Plaintext can be reversibly transformed into other binary formats. One common transformation is compression which usually results in a more compact representation, meaning fewer bits used to represent the same information.

Depending on what you're using the plaintext to represent, you may be able to use different binary formats to represent the same information. This may use more space, it may use less.

For example, the numbers 5 and 1234567 could be represented in plaintext using digit characters, resulting in these bit sequences on disk3:

00110101 00000000
00110001 00110010 00110011 00110100 00110101 00110110 00110111 00000000

Alternatively, you could use 32-bit two's complement:

00000000 00000000 00000000 00000101
00000000 00010010 11010110 10000111

Which is a less compact representation of 5, but more compact representation of 1234567.

And there is a literally infinite number of other representations which would have varying levels of compactness, and flexibility, although, in practice far less than that many representations are actually used.


1 Assuming UTF-8. The exact sequence of bits for a character depends on which specific encoding you're using.

2 Or really, several formats, given the various encodings.

3 If you're wondering what those eight zeros on the ends are, well, you need some way of knowing how long the data is. The options basically boil down to a marker (I used this, via a null byte), space dedicated to storing the length (Pascal used a byte to store the length of a string), or a fixed size (used in the subsequent two's complement example).