Byte Representation – How Can the Same Sequence of Bytes Represent Different Data Types?

byte

I have just started to study about computer systems and I came across this line. How can the difference in the contexts in which we view data objects make this happen?

Bytes store numbers, now that number represents something, a symbol or a character or something else so how can the same sequence of bytes represent different things in different contexts?

Best Answer

The 8-bit binary pattern 10000000 (aka 0x80), can represent:

  • in Windows Code Page 1252: the EURO symbol
  • in Latin-1: a control character
  • in UTF-8: a continuation byte contributing 6 bits of zero to a code point

Even getting the value as a number requires an interpretation:

Some early computers even strung the bits backward, as in the Manchester Baby experimental computer system (of the late 1940's).

So, a given bit string can have many different interpretations. Whenever we retrieve or operate on information we consider the size (number of bits/bytes), the actual bit pattern (the value), and the type, which describes the interpretation of that bit pattern. That way we know how to manipulate the information such as being able to use arithmetic on a number, or display a representation of the value (as a symbol or number) on the screen.

Floating point values generally use either 4 (float) or 8 bytes (double), though there are others. The interpretation of those bits uses fields, dividing the bits into (1) an overall sign bit, (2) mantissa bits, (3) an exponent sign bit and exponent value.

Since 4 and 8 bytes are also popular sizes for integer values, the same 4 byte value can be interpreted as integer or float.

There is no way to examine a bit pattern or value to know what type is intended, that is, whether the value is meant to be an integer, a float, or a code point or symbol, or something else. The value alone is insufficient; we need something extra for differentiation.

We generally do this the other way around: rather than looking at the bits to guess what it might be, we, as programmers, tell the computer how to interpret information. Thus, it is by design that we know how to interpret the value.

Another approach is to tag a value by using (some additional) memory for a tag or descriptor; this allows for dynamic interpretation.

It is generally considered a logic error or flaw in program design or a flaw in programming to interpret the same value using a one interpretation at one point and switching interpretations to another at some other point.

Programs generally decide in advance how a given 4 byte value is going to be interpreted, as we cannot tell by the value. This is part of the type system of programming languages, and in turn of computer instruction sets.

Good type systems prevent illegal and undesired program states and logic errors by preventing the accidental mixing of interpretations, and broadly speaking, this is a very serious issue in programming, with a lot of active and ongoing research. Many type systems use a combination of design-time, fixed determination of type, together with some form of tagging for more dynamic capabilities. Some type systems also work at preventing other logic errors beyond accidental misinterpretation, such as array reference out of bounds, null pointer dereferencing, memory leaks, ownership of memory, race conditions, etc..

A computer instruction set typically provides different instructions for manipulating different sized items with a variety of interpretations (e.g. as signed integers, as unsigned bits, as float or double, etc..). A high-level programming language makes use of these varied instructions to accomplish the intent of the program.