Electronic – Silicon bugs, errata sheets


In many (most??, all??) microcontrollers that I have used over the past years, there where sometimes some silicon level bugs, and the manufacturers provide the engineers with the errata sheets, describing what unexpected behaviour they may face.

Why don't they ever fix these "bugs"? Since the product is still produced, and in most cases solving the problem won't affect the previous implementations, why they do not just revise it? In many cases the product may be stabilized, most bugs may have been found, and may have a significant part of its product life-time ahead of it.

Is it so difficult (technically)? Expensive?

Best Answer

Critical bugs do get fixed. Usually they're fixed before the product enters production. Unless you're using early samples, you might never see the worst bugs.

Fixing bugs is difficult and expensive. It's not just changing one line of RTL code. If you did that, you'd have to resynthesize, redo the physical layout, tweak the layout to fix any timing problems, buy a whole new mask set, produce new wafers, test the wafers (normally), validate the new fixes, and possibly characterize or qualify the product again. This takes months and costs a distressing amount of money. For that reason, we try to fix bugs directly in the layout (preferably on a single metal layer). This is faster and cheaper than starting over from RTL synthesis, but it's still not good.

If we're fixing a critical bug anyway, why not fix all the other bugs too? Again, this takes time -- time to figure out and implement a fix, time to rerun the design verification tests. That time means it will take longer to get the next product to market. And in the meantime, you'll almost certainly find more bugs in your current product if you look hard enough. It's a losing battle. Fixing bugs is even harder on a product that's been out for a long time, since people have to dive into the old design to figure out what's going on. As Null says, customers may have to requalify your product in their system. If your product is still in development, delaying the production release may cause customer schedules to slip, which makes customers very unhappy.

Normally, the bugs that get left in only happen in weird configurations, cause very minor problems, have easy workarounds, or all of the above. They're just not bad enough to be worth the trouble. And if you reuse a hardware module on the next product, your existing customers will already have the workaround in their software anyway.

Software toolchains are another factor. If a module sticks around long enough, your toolchain might change enough that redoing the old validation tests becomes a major project in itself. And you probably can't just load up the old tools, because you're not paying for the site license anymore. But as long as you don't change the module, you can keep copying and pasting it into new MCUs.

Software is also an issue on the customer side. If your bugfix breaks backwards compatibility in any way, all of your customers will have to update their code, which they may not even have the tools for anymore.

As someone who works in microcontroller development, I can tell you that we would all love to fix every bug. But trying to do so would delay development unpredictably, annoy customers, cost a ton of money, and at the end of it all, we'd still probably fail.