Electronic – Is it realistic to expect full Spectre fix in branch predictors of future CPUs

computer-architecturemicroprocessor

Recently, it has been observed that two branches sharing the same branch predictor state in the same process or even across processes allows certain side channel exploits (Spectre).

Let's consider a basic branch predictor. It has 2^14 slots, and each slot consists of two bits (strongly taken, weakly taken, weakly not taken, strongly not taken). The 14 index bits are the low-order bits of the branch instruction. Thus, the branch predictor has 2*2^14 = 32768 bits or 4 kilobytes.

Now, there's a problem: when switching between user space processes, the branch predictor state is not flushed. This can be fixed by adding a branch predictor flush instruction that is actively used when doing context switches.

There's another problem: two branches in the same process address space can share the same slot. If one of these branches is untrusted JITed code and the other is branching on one bit of a secret password, we're in trouble. This could presumably be fixed by flushing the branch predictor state before running untrusted JITed code. However, if the process flushes the branch predictor too often, its benefits are lost.

However, there's a more general solution (proposed in https://security.stackexchange.com/questions/176678/is-branch-predictor-flush-instruction-a-complete-spectre-fix): store in addition to 2 bits the remaining address bits as a tag. On 32-bit architecture, it would be 32-14 = 18 bits, on 48-bit address space in a 64-bit architecture, it would be 48-14 = 34 bits. This would make the branch predictor state 20 bits or 36 bits instead of 2 bits, multiplying its size by 10 or 18.

Is this kind of general solution realistic? It would certainly require more transistors in the branch predictor. But is the branch predictor already too large, or is it so small that its size can easily be multiplied by the required factor? There could be other benefits as well, such as the possibility to make the branch predictor easily 2-way, 4-way or 8-way set associative like CPU caches are these days. Perhaps the benefits of set associativity are so great that this kind of more general solution could be deployed?

P.S. There really should be a CPU Microarchitecture Stack Exchange. I considered Information Security Stack Exchange, Stack Overflow, Computer Science Stack Exchange and Electronics Stack Exchange and decided to post here.

Best Answer

Die space is abundant. The L1 cache is(/used to be) located on the same die with the CPU, and it takes up a lot of space. So I think the 20X could be afforded for the branch predictor expansion. (The L2 in my day, now L2+L3, was on a separate die attached via the backside bus.) But I believe it is still the case that they could trade off a little L1 for a large expansion of branch prediction and the die size stays the same. This doesn't even count the space for the ROB and functional units attached to the registration station. I think it could be done. – jonk Jan 5 at 16:48

Related Solutions

ISA efficiency code compaction and memory traffic

As a hint to get you started, here are some possible instruction sequences for the first HLL statement:

Accumulator-based

load A
add B
store A

Stack-based

load A
load B
add
store A

Memory-to-memory (2-address)

add B, A

Register-based

load A, r1
load B, r2
add r2, r1
store r1, A

Your job is to figure out how big each instruction is, and also what the memory access patterns are for both instruction and data operations as each sequence executes.

Electronic – 2-bit branch prediction accuracy

First off, I think it's important for you to specify when you do these types of problems (and clarify by asking your professor) which type of 2-bit predictor is being used, because there are 2 types.

The first type makes a transition from a weak state to the alternate weak state on failure.

The second type makes a transition from a weak state to the alternate strong state on failure.

For this problem, I am assuming it is the first type, which transitions from a weak state to the alternate weak state upon failure. That is the type shown in the picture below:

enter image description here

Consider what happens in each case individually:

WT = weakly taken ST = strongly taken WN = weakly not taken SN = strongly not taken

Branch #1

Predict WT - T : 1/1

Predict ST - T : 1/1

Total : 2/2

Branch #2

Predict WT - N : 0/1

Predict WN - N : 1/1

Predict SN - N : 1/1

Total 2/3

Branch #3

Predict WT - T : 1/1

Predict ST - N : 0/1

Predict WT - T : 1/1

Predict ST - N : 0/1

Predict WT - T : 1/1

Total : 3/5

Branch #4

Predict WT - T : 1/1

Predict ST - T : 1/1

Predict ST - T : 1/1

Predict ST - N : 0/1

Total : 3/4

Branch #5

Predict WT - T : 1/1

Predict ST - T : 1/1

Predict ST - N : 0/1

Predict WT - T : 1/1

Predict ST - T : 1/1

Predict ST - T : 1/1

Total : 5/6

Total of all branches : 15/20

Best Answer

Related Solutions

ISA efficiency code compaction and memory traffic

Electronic – 2-bit branch prediction accuracy

Related Topic