Electronic – Logic operations in modern processor

digital-logic

Why most of processor only provide finite number of logic operations (NAND/NOR/XOR) while there are clearly 14 different logic gates available? I understand that we can implement any Boolean logic with only NAND/NOR . But practically, providing less gates make operations such as NIMPLY/IMPLY takes more than 1 cycle to finish as we need to implement them with composition of other gates. It seems to me that if ALU provides the whole set of hardware optimized logic gates we might be able to optimize software performance. Fanout might be an issue if all logic gates' output drive the same bus, but it seems logic operations aren't always the bottleneck of a modern processor because there are much more time consuming operations such as multiplication that limit the clock cycle. So my question is why we seldom see processors provide the full set of logic gates? What are the considerations involved? How do processor designers reach the optimal number of different logic operations the ALU is going to provide?

Best Answer

If you would like to understand how processor designers reach their 'optimal' instruction set, you might want to look at some books, for example "Computer Architecture, A Quantitative Approach" by Hennessy & Patterson

There are quite a lot of papers freely available on the web about computer architecture, and Instruction Set Architecture (ISA).

I'd suggest read the papers and watch the videos at http://riscv.org

As a concrete example, look at the The RISC-V Compressed Instruction Set Manual

Their process for the "RISC-V Compressed Instruction Set Manual" was to gather sets of recognised, representative, 'benchmark' programs, in high level languages like C, and run analysis across them. For example, they might use an existing C compiler, modify its code generation to generate their prototype ISA, pass their benchmark programs through it, and analyse the results. That relies to some extent on the compiler having a rich enough model to make use of their ISA's instructions.

In the specific case of that compressed instruction set analysis, they wanted to identify which instructions were sufficiently common, that providing a compressed instruction would have reduce the code significantly. However, that analysis uses techniques which have been applied before.

The compressed instruction set analysis showed that only 31 instructions accounted for the vast majority of code, so compressing those by 50% reduced total program size by 25%. As quite a lot of code is data, that suggests that a lot of instructions are rarely used.

Remember, adding more instructions, and expanding the ALU is not necessarily free. To maximise throughput (i.e. doing useful work) we try to get a balance between latency (delay) through the ALU, power consumption, and overall instruction time. Adding an instruction which might make the CPU run 5% slower on every instruction, for 0.1% of the code to run 2x faster makes no sense.

This is a version of Amdahl's Law, which essentially says no matter how much one part of a system is improved, if it only accounts for 1% of (a programs) resources (time or space), then the improvement can only benefit the system's performance by 1%.

The vast majority of instructions executed by a program are loads, stores, simple arithmetic, jumps and conditional branches. As long as they execute well, adding a few specialised instructions to make rarely used operations faster will make virtually no difference. Worse, if the addition of specialised instructions slows the common instructions by a tiny amount, then it's likely better to remove them.

To learn more, look for older (1980s) RISC research papers. Maybe start with John L. Hennessy and David Patterson as they were quite prolific publishers, and Hennesey founded MIPS, and Patterson's work became SUN SPARC.