I haven't had much personal experience with RTOS's other than QNX (which is great on the whole but it's not cheap and I have had a really bad experience with a particular board vendor and QNX's we-don't-care attitude for systems other than their most common) which is too large for PICs and MSP430's.
Where you will benefit from an RTOS is in areas such as
- thread management/scheduling
- inter-thread communications + synchronization
- I/O on systems with stdin/stdout/stderr or serial ports or ethernet support or a filesystem (not an MSP430 or PIC for the most part, except for the serial ports)
For peripherals of a PIC or MSP430: for serial ports I'd use a ring buffer + interrupts... something I write once per system and just reuse; other peripherals I don't think you'd find much support from an RTOS, as they are so vendor-specific.
If you need timing that is rock-solid to the microsecond, an RTOS probably won't help -- RTOS's have bounded timing, but typically do have timing jitter in their scheduling due to context switching delays... QNX running on a PXA270 had jitter in the tens of microseconds typical, 100-200us maximum, so I wouldn't use it for stuff that has to run faster than about 100Hz or which needs timing much more accurate than about 500us. For that kind of stuff you probably will have to implement your own interrupt handling. Some RTOS's will play nicely with that, and others will make it a royal pain: your timing and their timing may not be able to coexist well.
If the timing/scheduling is not too complex, you may be better off using a well-designed state machine. I would highly recommend reading Practical Statecharts in C/C++ if you haven't already. We've used this approach in some of our projects where I work, and it's got some real advantages over traditional state machines for managing complexity.... which is really the only reason you need an RTOS.
Yes, C++ is still useful in embedded systems. As everyone else has said, it still depends on the system itself, like an 8-bit uC would probably be a no-no in my book even though there is a compiler out there and some people do it (shudder). There's still an advantage to using C++ even when you scale it down to something like "C+" even in a 8-bit micro world. What do I mean by "C+"? I mean don't use new/delete, avoid exceptions, avoid virtual classes with inheritance, possibly avoid inheritance all together, be very careful with templates, use inline functions instead of macros, and use const
variables instead of #defines
.
I've been working both in C and C++ in embedded systems for well over a decade now, and some of my youthful enthusiasm for C++ has definitely worn off due to some real world problems that shake one's naivete. I have seen the worst of C++ in an embedded systems which I would like to refer to as "CS programmers gone wild in an EE world." In fact, that is something I'm working on with my client to improve this one codebase they have among others.
The danger of C++ is because it's a very very powerful tool much like a two-edged sword that can cut both your arm and leg off if not educated and disciplined properly in it's language and general programming itself. C is more like a single-edged sword, but still just as sharp. With C++ it's too easy to get very high-levels of abstraction and create obfuscated interfaces that become meaningless in the long-term, and that's partly due to C++ flexibility in solving the same problem with many different language features(templates, OOP, procedural, RTTI, OOP+templates, overloading, inlining).
I finished a two 4-hour seminars on Embedded Software in C++ by the C++ guru, Scott Meyers. He pointed out some things about templates that I never considered before and how much more they can help creating safety-critical code. The jist of it is, you can't have dead code in software that has to meet stringent safety-critical code requirements. Templates can help you accomplish this, since the compiler only creates the code it needs when instantiating templates. However, one must become more thoroughly educated in their use to design correctly for this feature which is harder to accomplish in C because linkers don't always optimize dead code. He also demonstrated a feature of templates that could only be accomplished in C++ and would have kept the Mars Climate Observer from crashing had NASA implemented a similar system to protect units of measurement in the calculations.
Scott Meyers is a very big proponent on templates and judicious use of inlining, and I must say I'm still skeptical on being gung ho about templates. I tend to shy away from them, even though he says they should only be applied where they become the best tool. He also makes the point that C++ gives you the tools to make really good interfaces that are easy to use right and make it hard to use wrong. Again, that's the hard part. One must come to a level of mastery in C++ before you can know how to apply these features in most efficient way to be the best design solution.
The same goes for OOP. In the embedded world, you must familiarize yourself with what kind of code the compiler is going to spit out to know if you can handle the run-time costs of run-time polymorphism. You need to be willing to make measurements as well to prove your design is going to meet your deadline requirements. Is that new InterruptManager class going to make my interrupt latency too long? There are other forms of polymorphism that may fit your problem better such as link-time polymorphism which C can do as well, but C++ can do through the Pimpl design pattern (Opaque pointer).
I say that all to say, that C++ has its place in the embedded world. You can hate it all you want, but it's not going away. It can be written in a very efficient manner, but it's harder to learn how to do it correctly than with C. It can sometimes work better than C at solving a problem and sometimes expressing a better interface, but again, you've got to educate yourself and not be afraid to learn how.
Best Answer
There are far too many degrees-of-freedom to understand "all" the possible faults. There are, however, techniques to identify and mitigate faults early in the design cycle (i.e. before wide release).
Design-time activites (pre-hardware)
Peer review is always a great way to find bugs. Have someone else analyze your design, and be prepared to defend against their questions (or acknowledge that they found a bug, and fix it!) There's no substitute for scrutiny, and fresh eyes often see things that are missed by tired ones. This works for both hardware and software - schematics can be reviewed just as easily as source code.
For the hardware, as others have said, a DFMEA (Design Failure Mode and Effects Analysis) is a good recommendation. For each component, ask yourself "what happens if this shorts out" and "what happens if this goes open-circuit", and make a record of your analysis. For ICs, also imagine what happens if adjacent pins are shorted to each other (solder bridges, etc.)
For the firmware, static code analysis tools (MISRA, lint, etc.) can be used to reveal hidden bugs in the code. Things like floating pointers and equality-instead-of-compare (= vs ==) are common 'oopsies' that these tools will not miss.
A written theory of operation is also very helpful, for both hardware and software. A theory of operation should describe in a fairly high level how the system works, how the protections work, sequencing, etc. Simply putting to words how the logic should flow often leads to one realizing that some cases may have been missed ("Um, waitasec, what about this condition?")
Prototype level testing
Once you get hardware in hand, it's time to get to "work".
After all of the theoretical analysis is done, it is crucial to accurately characterize how the device operates within spec. This is commonly referred to as validation testing or qualification. All of the allowable extremes need to be tested.
Another important qualification activity is component stress analysis. Every part is evaluated against its maximum voltage/current/temperature, in a defined operating condition. In order to ensure robustness, an appropriate derating guideline should be applied (don't exceed 80% of voltage, 70% of power, etc.)
Only once you know how things are under normal conditions can you start to speculate about external abnormals, or multiple abnormals like you're describing. Again, the DFMEA model (what happens if X happens) is a good approach. Think of any possible thing a user could do to the unit - short outputs, tie signals together, spill water on it - try them, and see what happens.
A HALT test (highly accelerated life test) is also useful for these types of systems. The unit is put into an environmental chamber and exercised from minimum to maximum temperature, minimum and maximum input and output, with vibration. This will find all sorts of issues, both electrical and mechanical.
This is also a good time to do some embedded fuzz testing - exercise all of the inputs well beyond their expected ranges, send gibberish in through UARTs / I2C, etc. to find holes in the logic. (Bit-banged I2C routines are notorious for locking up the bus, for instance.)
Strife testing is a good way to demonstrate robustness. Disable any protection features like overtemperature, overload, etc. and apply stress until something breaks. Take the unit up as high in temperature as it can go until something fails or some erratic behaviour occurs. Overload the unit until the powertrain fails. If some parameter fails only slightly above worst-case conditions, its an indication of marginality and some design consideration may have to be revisited.
You can also take the next-level approach and physically test some of your DFMEA conclusions - actually do the shorts and opens and pin-shorts and see what blows up.
Further reading
My background is in power conversion. We have an industry standard called IPC-9592A which is an effort to standardize how products should be qualified in terms of what tests and how they should be done. Many of the types of tests and methodologies referred to by this document could easily be used in other electrical disciplines.