Bytecode – How Are Literal Values Encoded?

bytecodecompilationdesigndesign-patternslanguage-design

Note: This question is somewhat related to How exactly is bytecode "parsed"?, but it is not a duplicate of it. In this question, I'm asking about a specific part of how bytecode is generated, not how bytecode is "parsed".


As stated in the title, how are literals(such as strings, integers, etc…), encoded into bytecode files? My confusion comes from the fact that the byte representation of any given literal is of variable length. Which means that a virtual machine would have no idea how many bytes it needs to gather in order to read the entire literal. If my question is still not clear, I believe a visual example will help illustrate my confusion.

Consider this example. A parser has just constructed an abstract syntax tree. It converted the expression 3 + 2 into:

   +
  / \
 3   2

Your compiler/interprter now begins to walk the tree. It generates the following bytecode:

 PUSH          3            PUSH        2         ADD
  |            |             |          |          |    
|-----| |--------------|  |-----|  |----------| |-----|
b'\x00' b'\x00\x00\x00\'  b'\x00'  b'\x00\x00\' b'\x05'

Your virtual machine then begins reading in the bytecode file. It reads the first byte, and sees that it is the opcode PUSH. It now needs to read the argument to the opcode PUSH.

But here is the problem. The virtual machine has no way of knowing how many bytes it needs to read to get the entire argument to PUSH. the arguments to PUSH are a variable number of bytes, so the virtual machine does not know how many bytes it needs to read for each argument. As seen in the pseudo bytecode above, The number of bytes used to represent different values can vary, and are not consistent.

And while the above example only uses integers, this can also apply to other things. Such as strings, or string representations of identifier names.


I tried searching on various blogs and even the official documentation of some languages bytecode, but I still have not found a explanation for how literals are encoded.

The closet information I have came across, was a sentence from this answer given by Ratchet Freak to the question I linked to in the header. It reads:

To give an example that makes the bytes-per-operation very explicit there is SPIR-V. The first 4-byte word of each instruction is constructed as 2-byte length + 2-byte opcode.

It seems that what he is saying is that SPIR-V forces all opcodes an their arguments to be compressed or expand to fill two bytes. While I suppose on could do this, I'm fairly sure this is not what he meant.

What is the common practice when encoding literal values, whose byte representations are of variable length, into bytecode files? Of course, I assume that their is a common practice, but perhaps each language does it differently?

Best Answer

Your question applies more broadly than byte code systems, to general instruction set architecture, hardware or byte code.

What is the common practice when encoding literal values, whose byte representations are of variable length?

There are about half a dozen reasonable techniques.

  • The opcode tells you the number of bytes of the literal that is following the opcode. This means there are usually several of otherwise identical opcodes. Note that the opcode must (somehow) encode the size or type of the operand manipulation (e.g. push 32-bit int), which can be done together or separately from encoding the size/count of the literal data bytes (often called an immediate) following the opcode. In case these differ (often the immediate literal described by the instruction is shorter than the type for the operand), then, the byte(s) following the opcode are expanded as per the definition of the opcode (e.g. using sign extension), from the size of the immediate literal provided to the size of the operand type.

  • There are other bits after the opcode, yet are considered separate from the opcode, that tell the size of the literal (and/or format of (all) operands). When an instruction set has sub-opcodes grouped together sometimes bits beyond the major opcode indicate things about the various operands.

  • A variant of the last is that each operand has its own separate descriptor (possibly grouped together before the literal). These are typical in CISC-style register machines (like the VAX) having multiple operand instructions, such as addl3 (three operand add long).

  • There are bits in the literal itself that tell whether more data of the literal follows; for example, one bit of each byte can be dedicated to indicating more bytes, meaning each literal byte yields 7 bits, and tells if the next byte is a literal or the literal is completed. This is somewhat hostile to (software) interpretation performance, but hardware can decode this better than the naive approach would seem to indicate. If you are doing a JIT instead of interpreter, this may work alright.

  • An indirection of some sort is used, and the literal is stored elsewhere. This the case, for example, with strings in Java/C# byte code. In Java the push string opcode uses an index into the constant table. Application binary interfaces often specify one machine register or an accessible global location for larger constants like strings, or other 32-bit, 64-bit or larger blob constants.

  • Sometimes literals can be large enough or complex enough bit pattern that multiple instructions are used to assemble the literal. Some architectures provide a load immediate that takes its literal operand and puts it into the high bytes of the register (or stack). Then a regular add immediate is used to bring in the low bits of the literal. This is sometimes found in architectures that uses fixed size instructions.

Related Topic