None of those options are particularly better or worse than the others, because they're all very insecure. I'm going with option 4.
SRAM is the most secure place to store keys, but you must never inject them from the outside world. They must ALWAYS be generated within the processor, during boot. Doing anything else instantly invalidates the rest - it's automatically insecure.
Don't store keys in nonvolatile memory, you are correct on this. It doesn't matter if you protect the EEPROM or flash memory from being read. That code read protection fuse is easily reversed. An attacker need only decap (remove or chemically etch away the black epoxy packaging to expose the silicon die inside). At this point, they can cover up the part of the die that is non volatile memory cells (these sections are very regular and while individual memory cells are much to small to be seen, the larger structure can be) and a small piece of something opaque to UV is masked over that section. Then the attacker can just shine a UV light on the chip for 5-10 minutes, and reset all the fuses, including the CRP fuse. The OTP memory can now be read by any standard programmer.
Or, if they're well funded (say, getting those keys are worth more than $1000 to someone), they can just read the memory cells directly with several types of electron microscopes.
To be secure, keys must be erased, not concealed.
- No, for the same reasons above.
Now, on to option 4:
- Just use encryption. Key distribution is a solved problem. So use that readily available solution. The chip should use its RNG and various other considerations should be made to ensure it has a sufficient supply of entropy available, and the boot loader should boot directly into the program that generates the needed secret key(s), which should be in general purpose registers and moved directly into SRAM, where they will stay until erased.
There is a problem however, which is that nothing except the CPU has any idea what the secret key is. No problem: use public key cryptography. What you DO have stored in the OTP memory is your public key. This key can be read by anyone, you can post it on stack exchange, you can paint it on the side of an oil tanker in 5 foot high letters, it doesn't matter. The wonderful thing about public key cryptography is that it is asymmetric. The key to encrypt something cannot decrypt it, that requires the private key. And conversely, the key to decrypt something encrypted by the public key cannot be used to encrypt something. So, the CPU generates the secret keys, uses your stored public key to ENCRYPT the secret keys, and simply sends it out over USB or RS232 or whatever you want. Reading the secret key requires your private key, which need not be stored, sent, or ever involved at all with the chip. Once you have the secret key decrypted with your private key (elsewhere, outside the chip), you're set. You have a securely transmitted secret key that was GENERATED entirely within the chip, without having to store anything except a public key - which as stated earlier, need not be protected at all from being read.
This process is formally called key negotiation, and every thing uses it. You've used it several times today. There are many resources and libraries available to handle it. Please, do not ever 'inject' keys into anything.
One last thing to mention: All of this is moot because the AES key can be easily recovered using side channel attacks, which sit on the power supply and measure minute changes in current draw and the timing between those changes caused by bits flipping in the CPU as registers. This, combined with knowledge of how AES (or whatever one of the very small set of possible encryption algorithms that could be used) works, makes it relatively easy and inexpensive to recover the key. It won't permit reading the key, but it can narrow down the key space to something ridiculously small, like 255 possible keys. The chip also can't detect it, since it is upstream.
This has defeated AES-256 encrypted boot loaders on 'secure' crypto processors and it's not even that hard. As far as I know, there are no true hardware counter measures to this attack. However, it is the encryption algorithms themselves, and how they require a CPU to flip bits, that is causing this vulnerability. I suspect that side-channel resistant or side-channel proof algorithms will need to be (and hopefully are) being developed.
So as it stands right now, the real answer to how to store a key (or even just use a temporary key) on an embedded device securely is: you can't.
But at least if you generate a new key every time using key negotiation in option 4, then a side channel attack can only compromise the key of an in-use channel, and only if they have a while to monitor the power while it encrypts data. If you frequently negotiate new keys generated internally, this can afford useful amounts of security.
Generate keys, and store them for as short a time as possible.
Overhead does not relate to preemption. Preemption stops your process and runs another process. If you disable that for e.g. one CPU core, you have the CPU core alone for your process.
Still, there's an overhead if you do I/O through the Linux I/O functions instead e.g. controlling the I/O lines directly through your process. But, for any reasonably complicated I/O, you had to implement a function with similar overhead yourself and you can assume the Linux folks are better than you in that – more experience and more test cases.
So, the only way you could win the overhead race is with very basal I/O functions (e.g. bitbanging an exotic, fast protocol) you implement yourself in "your kernel" instead of running it through the various abstraction layers of the Linux GPIO subsystem and do the same in "your userspace process".
Best Answer
Funny, I use both at work :)
The Cortex-M3 (we use STM32s) is a general purpose MCU that is fast and big (flash storage) enough for most complex embedded applications.
However, the R4 is a different beast entirely - at least the Texas Instruments version I use: the RM42, similar to the TMS570. The RM42 is a Cortex-R4 with two cores running in "lock-step" for redundancy, which means that one core is 2 instructions ahead of the other and is used for some error checking and correction. Also, one of the cores are (physically) mirrored/flipped and turned 90 degrees to improve radiation/noise resilience :)
The RM42 runs at a higher clock speed than the STM32 (100MHz vs 72MHz) and has a slightly different instruction set and performs some of the instructions faster than the M3 (e.g. division instructions execute in one cycle on the R4, not sure they do on M3).
HW timers are VERY precise compared to Cortex-M3. Usually we need a static offset to correct for drift on the M3s - not so with the R4 :)
Where I'd call a Cortex-M3 a general purpose MCU, I'd call the Cortex-R4 a complex real-time/safety MCU. If I am not mistaken, the RM42 is SIL3-compliant...
IMO the R4 is a big step up in complexity even if you're not planning to actually use the real-time/safety features.
A really nice example of the complexity difference: The SPI peripheral has 9 control and status registers on the STM32 whereas the RM42 has 42. It's like this with all the peripherals :)
EDIT:
For what it's worth, in my use cases the Cortex-R4 @ 100MHz is usually 50-100% faster than the Cortex-M3 @ 72MHz when performing the exact same tasks. Maybe because the R4 has data and instruction caches?
Another comparison, a few 1000 lines of C and ASM code are executed on reset before reaching the call to
main()
with the subset of the safety features I currently use :D and not peripheral initialization or anything, just startup and self test (CPU, RAM, Flash ECC etc.).This page has more details