Electronic – Would semi-VLIW make sense

cpuinstruction-set

Typical CPU instruction set: the CPU has several functional units, and when each instruction is read, some bits specify which functional unit is to be activated, while others specify the details of the operation.

Today's CPUs tend to spend some of their large transistor budgets on out of order execution, where they have a bunch of instructions in flight at once, trying to keep multiple functional units busy all while creating the illusion that it's still a serial instruction stream for backward compatibility. For the purposes of this question, to keep it simple, say we are talking about 90s technology with no more than a few million transistors.

VLIW: let's save time and transistors on decoding, and keep all the functional units busy, by having an instruction word that has a group of bits for each functional unit, specifying work for them all at the same time.

One problem with VLIW is it exposes more hardware details, making it difficult to keep backward compatibility with subsequent CPU generations without losing the advantages. Let's say we get around that problem with byte code and a JIT compiler.

Another problem is that not all workloads want to use all the functional units, and that's what I'm looking at now.

Let's say we're designing a CPU for an early 90s workstation, in the era of the 486 and i860. Roughly speaking, there will be two kinds of workloads, number
crunching (e.g. CAD, simulations and of course 3D games) that uses the floating point units, and other stuff (e.g. word processing, compiling) that doesn't; this roughly corresponds to SPECfp vs SPECint.

For number crunching workloads, we could design a straight VLIW, maybe with a 64-bit instruction word that specifies integer and floating point operations, control flow etc all at once, and that will give great performance. But for integer workloads it will have poor code density because the bits specifying floating point operations will always be saying 'no FP ops this clock cycle'.

Would it make sense for the CPU to have a mode bit to switch between the two kinds of workloads? In FP mode, it works like the above, but in integer mode, maybe each 64-bit word just supplies a pair of 32-bit integer instructions? Would that give most of the simplicity and performance advantages of VLIW without the potential code density disadvantage?

Best Answer

Go read the book, "Bulldog: A Compiler for VLIW Architectures," by John R. Ellis. It is absolutely superb.

As you might suspect, one of the huge problems with VLIW is about the compilers themselves. How do you get compiler authors (I've been one of them for a short period of my life) to decide to add in all the needed stuff to make VLIW work out?

You admit that "one problem with VLIW is it exposes more hardware details," but you failed to take that in the direction you needed to go. VLIW exposes the hardware details to the compiler writers!! And they are the ones you need to support your CPU!

Frankly, that's going to be very hard. It's already been like pulling teeth to get far more "run of the mill" optimizations into their compilers. Some have happened, such as partial template specialization for C++ (very much needed to reduce code "bloom".) But even then, it took forever. And there are techniques I learned about in the 1970's that still haven't found their way into compilers today, or are just barely finding purchase.

For example, John's Bulldog compiler performed transformations that would move code across edges (conditional boundaries generated by an "if" statement.) So code that followed an "if" would be moved above the "if" and executed speculatively, tossing the results if the "if" statement went a different direction. The coder had the ability to provide "hints" to the compiler so that the compiler could gain some added information about which condition was more likely to occur. The compiler included features to deal with DRAM banks so that data could be placed in such a way to minimize the fetch times (cache was more expensive then) based on order of access in loop structures.

It's really a very good book to read and I think you'd enjoy it.

But the key here is getting compiler writers, those in control of compilers that people actually use a lot, to implement all this.


To give you a flavor of just how all this played out, note that the MIPS R2000 RISC processor could beat the daylights out of the Intel x86 core in 1986. (I was there, working on R2000 plug in boards for the IBM PC and writing operating system code for all this and I spent a while at MIPS getting 1:1 training directly from John Hennessy.) And they used FAB technology that was ages old to do it, too. Intel and Motorola then had the best FABs in the world (and would not share them, of course) and MIPS had to use hand-me-down FAB technology with larger feature sizes and a tenth or less of the transistor count. And they still slaughtered Intel's x86 by a mile.

So what did Intel do? Well, of course they started up their own RISC projects. But what really changed the game was all the things you already talked about in your question. (I was working at Intel on their chipsets while this was happening.)

Rather than throw away the x86, they leveraged their huge FAB advantage and instead added parallel instruction decoders so they could decode up to three instructions per clock. They added a RISC engine (yup, just created one and popped it into a corner of their die) and a "re-order buffer" (ROB) to hold RISC instructions that were decoded out of the CISC x86 instructions by the decoder. They added a registration station so that they could run functional units in parallel. They added a retire-unit so that while they executed the instructions out of order, they had a means to put that back into order in terms of how it appeared to the outside. And the retire unit could retire three RISC instructions per clock, too. They added branch prediction, caches (several levels), and a lot more.

In short, they THREW transistors at the problem. And they had them to spare, too.

In the end, they were able to hold out long enough that MIPS and the Motorola 88k and the DEC Alpha and their own internal RISC projects became pointless, again. Technology made the early "bare metal" advantages disappear by the parallel growth in FAB technology and going from million transistor dice to billion transistor dice. It didn't take that long, actually -- 10 years maybe? -- before those initiatives became far less relevant again.

And so here we are.

If you could get significant compiler development effort to support VLIW, you might get some motion with the then-exposed-to-view functional units. But really? There really today isn't that much of an advantage that cannot be addressed by just throwing some corner of the die at it, now. There was a day when that meant something. But today the problem isn't getting the most out of the hardware you have, but what to do with all those transistors they have available now. For gosh sake, they just turn them into cache as the best thing they can do with some?

And the compiler writers are going to be similarly "not interested," as well.

In my opinion the world has changed here. They have so many transistors that they have little idea what to do with. They don't have the problem of not enough transistors that they need to make better use of, that they once had.

Related Topic