Bytecode Parsing – How Exactly is Bytecode Parsed?

bytebytecodecode generationparsingvirtual machine

How is Bytecode "parsed"?

It is my understand that Bytecode is a binary, intermediate representation of the syntax of a given programming language. Certain programming languages convert their source text into Bytecode which is then written to a file. How does the Virtual Machines of those languages "parse" their Bytecode?

To narrow down this question, take Python's Bytecode for instance. When the Python Virtual Machine begins reading Bytecode from a *.pyc file, how does the Virtual Machine translate the stream of bytes it is reading, into specific instructions?

When the Virtual Machine reads bytecode from a file, it is my understanding that the bytecode one long stream of bytes. How then, is the bytecode broken into useful chunks? How is it transformed into an opcode and the opcodes arguments?

For example, say the Virtual Machine was reading in the bytecode to add two numbers. The Virtal Machine sees the instruction 0x05, which would mean "add two numbers".

Each number could be represented by a different number of bytes, so how would the Virtual Machine know how many bytes it would need to read ahead to gather the arguments for the op 0x05?

Best Answer

I think your confusion comes from thinking of bytecodes as a language that is being interpreted by the virtual machine. While this is technically a correct way to describe it, it's leading you to some assumptions about things that are not correct.

The first thing to understand is that bytecode is a type of machine code. The only thing that makes it different from the machine code that your CPU understands is that the machine in this case is virtual (hardware that uses bytecode directly is possible.) This might seem like a big distinction but if you consider what emulators do, whether the target machine is virtual or not isn't really of much importance in the context of the machine language.

Machine code is easy for computers to parse because is is expressly built for to make it easy to do so. The main distinction between machine languages and the higher languages most people are familiar with is the latter are generally built to be easy for humans to use.

This 1997 article on java bytecode might help. Let's walk through an example from that text:

84 00 01

For the first byte (called the opcode) is 84. We can lookup what that opcode means and find that it's iinc (increment local variable #index by signed byte const) and that the two following bytes indicate the index of the variable and the amount, respectively. The JVM then takes that instruction and translates it (while following the language specification) into machine instructions that correspond to the bytecode instructions.

Related Topic