I've used both PICkit 3s and ICD 3s. Never had a problem so far with the PICkits, but have fried a couple of ICD 3s.
The ICD 3s of course are more expensive (and much faster). The good thing though is the ICD 3s have a lifetime warranty; if you have a problem with one, they include a little test board to verify whether the problem is in the ICD 3 or your circuit. If the test results in an error message, then you can send the ICD 3 in and they will replace it free of charge. I have done this twice in the last year and a half, no questions asked.
Funny, I use both at work :)
The Cortex-M3 (we use STM32s) is a general purpose MCU that is fast and big (flash storage) enough for most complex embedded applications.
However, the R4 is a different beast entirely - at least the Texas Instruments version I use: the RM42, similar to the TMS570. The RM42 is a Cortex-R4 with two cores running in "lock-step" for redundancy, which means that one core is 2 instructions ahead of the other and is used for some error checking and correction.
Also, one of the cores are (physically) mirrored/flipped and turned 90 degrees to improve radiation/noise resilience :)
The RM42 runs at a higher clock speed than the STM32 (100MHz vs 72MHz) and has a slightly different instruction set and performs some of the instructions faster than the M3 (e.g. division instructions execute in one cycle on the R4, not sure they do on M3).
HW timers are VERY precise compared to Cortex-M3. Usually we need a static offset to correct for drift on the M3s - not so with the R4 :)
Where I'd call a Cortex-M3 a general purpose MCU, I'd call the Cortex-R4 a complex real-time/safety MCU. If I am not mistaken, the RM42 is SIL3-compliant...
IMO the R4 is a big step up in complexity even if you're not planning to actually use the real-time/safety features.
A really nice example of the complexity difference: The SPI peripheral has 9 control and status registers on the STM32 whereas the RM42 has 42. It's like this with all the peripherals :)
EDIT:
For what it's worth, in my use cases the Cortex-R4 @ 100MHz is usually 50-100% faster than the Cortex-M3 @ 72MHz when performing the exact same tasks. Maybe because the R4 has data and instruction caches?
Another comparison, a few 1000 lines of C and ASM code are executed on reset before reaching the call to main()
with the subset of the safety features I currently use :D and not peripheral initialization or anything, just startup and self test (CPU, RAM, Flash ECC etc.).
This page has more details
Best Answer
This is pretty standard in software engineering as a whole - when you optimize code, the compiler is allowed to re-arrange things pretty much however it wants, as long as you can't tell any difference in operation. So, for instance, if you initialize a variable inside every iteration of a loop, and never change the variable inside the loop, the optimizer is allowed to move that initialization out of the loop, so that you're not wasting time with it.
It might also realize that you compute a number which you then don't do anything with before over-writing. In that case, it might eliminate the useless computation.
The problem with optimization is that you'll want to put a breakpoint on some piece of code, which the optimizer has moved or eliminated. In that case, the debugger can't do what you want (generally, it will put the breakpoint somewhere close). So, to make the generated code more closely resemble what you wrote, you turn off optimizations during debug - this insures that the code you want to break on is really there.
You need to be careful with this, however, as depending on your code, optimization can break things! In general code that is broken by a correctly functioning optimizer is really just buggy code that's getting away with something, so you usually want to figure out why the optimizer breaks it.