Why was the Itanium processor difficult to write a compiler for

compilerhistory

It's commonly stated that Intel's Itanium 64-bit processor architecture failed because the revolutionary EPIC instruction set was very difficult to write a good compiler for, which meant a lack of good developer tools for IA64, which meant a lack of developers creating programs for the architecture, and so no one wanted to use hardware without much software for it, and so the platform failed, and all for the want of a horseshoe nail good compilers.

But why was the compiler stuff such a difficult technical problem? It seems to me that if the explicit parallelism in EPIC was difficult for compiler vendors to implement… why put that burden on them in the first place? It's not like a good, well-understood solution to this problem didn't already exist: put that burden on Intel instead and give the compiler-writers a simpler target.

Itanium came out in 1997. By this point, the UCSD P-Code bytecode system was nearly 20 years old, the Z-machine just slightly younger, and the JVM was the hot new rising star in the world of programming languages. Is there any reason why Intel didn't specify a "simple Itanium bytecode" language, and provide a tool that converts this bytecode into optimized EPIC code, leveraging their expertise as the folks who designed the system in the first place?

Best Answer

The Wikipedia article on EPIC has already outlined the many perils common to VLIW and EPIC.

If anyone does not catch the sense of fatalism from that article, let me highlight this:

Load responses from a memory hierarchy which includes CPU caches and DRAM do not have a deterministic delay.

In other words, any hardware design that fails to cope with (*) the non-deterministic latency from memory access will just become a spectacular failure.

(*) By "cope with", it is necessary to achieve reasonably good execution performance (in other words, "cost-competitive"), which necessitates not letting the CPU fall idle for tens to hundreds of cycles ever so often.

Note that the coping strategy employed by EPIC (mentioned in the Wikipedia article linked above) does not actually solve the issue. It merely says that the burden of indicating data dependency now falls on the compiler. That's fine; the compiler already has that information, so it is straightforward for the compiler to comply. The problem is that the CPU is still going to idle for tens to hundreds of cycles over a memory access. In other words, it externalizes a secondary responsibility, while still failing to cope with the primary responsibility.

The question can be rephrased as: "Given a hardware platform that is destined to be a failure, why (1) didn't (2) couldn't the compiler writers make a heroic effort to redeem it?"

I hope my rephrasing will make the answer to that question obvious.


There is a second aspect of the failure which is also fatal.

The coping strategies (mentioned in the same article) assumes that software-based prefetching can be used to recover at least part of the performance loss due to non-deterministic latency from memory access.

In reality, prefetching is only profitable if you are performing streaming operations (reading memory in a sequential, or highly predictable manner).

(That said, if your code makes frequent access to some localized memory areas, caching will help.)

However, most general-purpose software must make plenty of random memory accesses. If we consider the following steps:

  • Calculate the address, and then
  • Read the value, and then
  • Use it in some calculations

For most general-purpose software, these three must be executed in quick succession. In other words, it is not always possible (within the confines of software logic) to calculate the address up front, or to find enough work to do to fill up the stalls between these three steps.

To help explain why it is not always possible to find enough work to fill up the stalls, here is how one could visualize it.

  • Let's say, to effectively hide the stalls, we need to fill up 100 instructions which do not depend on memory (so will not suffer from additional latency).
  • Now, as a programmer, please load up any software of your choice into a disassembler. Choose a random function for analysis.
  • Can you identify anywhere a sequence of 100 instructions (*) which are exclusively free of memory accesses?

(*) If we could ever make NOP do useful work ...


Modern CPUs try to cope with the same using dynamic information - by concurrently tracking the progress of each instruction as they circulate through the pipelines. As I mentioned above, part of that dynamic information is due to non-deterministic memory latency, therefore it cannot be predicted to any degree of accuracy by compilers. In general, there is simply not enough information available at the compile-time to make decisions that could possibly fill up those stalls.


In response to the answer by AProgrammer

It is not that "compiler ... extracting parallelism is hard".

Reordering of memory and arithmetic instructions by modern compilers is the evidence that it has no problem identifying operations that are independently and thus concurrently executable.

The main problem is that non-deterministic memory latency means that whatever "instruction pairing" one has encoded for the VLIW/EPIC processor will end up being stalled by memory access.

Optimizing instructions that do not stall (register-only, arithmetic) will not help with the performance issues caused by instructions that are very likely to stall (memory access).

It is an example of failure to apply the 80-20 rule of optimization: Optimizing things that are already fast will not meaningfully improve overall performance, unless the slower things are also being optimized.


In response to answer by Basile Starynkevitch

It is not "... (whatever) is hard", it is that EPIC is unsuitable for any platform that has to cope with high dynamism in latency.

For example, if a processor has all of the following:

  • No direct memory access;
    • Any memory access (read or write) has to be scheduled by DMA transfer;
  • Every instruction has the same execution latency;
  • In-order execution;
  • Wide / vectorized execution units;

Then VLIW/EPIC will be a good fit.

Where does one find such processors? DSP. And this is where VLIW has flourished.


In hindsight, the failure of Itanium (and the continued pouring of R&D effort into a failure, despite obvious evidence) is an example of organizational failure, and deserves to be studied in depth.

Granted, the vendor's other ventures, such as hyperthreading, SIMD, etc., appears to be highly successful. It is possible that the investment in Itanium may have had an enriching effect on the skills of its engineers, which may have enabled them to create the next generation of successful technology.