Electronic – Is it realistic to expect full Spectre fix in branch predictors of future CPUs

computer-architecturemicroprocessor

Recently, it has been observed that two branches sharing the same branch predictor state in the same process or even across processes allows certain side channel exploits (Spectre).

Let's consider a basic branch predictor. It has 2^14 slots, and each slot consists of two bits (strongly taken, weakly taken, weakly not taken, strongly not taken). The 14 index bits are the low-order bits of the branch instruction. Thus, the branch predictor has 2*2^14 = 32768 bits or 4 kilobytes.

Now, there's a problem: when switching between user space processes, the branch predictor state is not flushed. This can be fixed by adding a branch predictor flush instruction that is actively used when doing context switches.

There's another problem: two branches in the same process address space can share the same slot. If one of these branches is untrusted JITed code and the other is branching on one bit of a secret password, we're in trouble. This could presumably be fixed by flushing the branch predictor state before running untrusted JITed code. However, if the process flushes the branch predictor too often, its benefits are lost.

However, there's a more general solution (proposed in https://security.stackexchange.com/questions/176678/is-branch-predictor-flush-instruction-a-complete-spectre-fix): store in addition to 2 bits the remaining address bits as a tag. On 32-bit architecture, it would be 32-14 = 18 bits, on 48-bit address space in a 64-bit architecture, it would be 48-14 = 34 bits. This would make the branch predictor state 20 bits or 36 bits instead of 2 bits, multiplying its size by 10 or 18.

Is this kind of general solution realistic? It would certainly require more transistors in the branch predictor. But is the branch predictor already too large, or is it so small that its size can easily be multiplied by the required factor? There could be other benefits as well, such as the possibility to make the branch predictor easily 2-way, 4-way or 8-way set associative like CPU caches are these days. Perhaps the benefits of set associativity are so great that this kind of more general solution could be deployed?

P.S. There really should be a CPU Microarchitecture Stack Exchange. I considered Information Security Stack Exchange, Stack Overflow, Computer Science Stack Exchange and Electronics Stack Exchange and decided to post here.

Best Answer

Die space is abundant. The L1 cache is(/used to be) located on the same die with the CPU, and it takes up a lot of space. So I think the 20X could be afforded for the branch predictor expansion. (The L2 in my day, now L2+L3, was on a separate die attached via the backside bus.) But I believe it is still the case that they could trade off a little L1 for a large expansion of branch prediction and the die size stays the same. This doesn't even count the space for the ROB and functional units attached to the registration station. I think it could be done. – jonk Jan 5 at 16:48