I'm assuming that you're using the same debugging process I would do in this case --
one of the first instructions in the interrupt routine turns on a LED,
and one of the last instructions in the interrupt routine turns that LED off.
Then you used a dual-trace oscilloscope with one probe clipped to the appropriate pin to watch the bytes going into the UART, and the other probe clipped to the pin driving the LED.
I'm assuming your UART-handler interrupt routine ends with the return-from-interrupt instruction (rather than using the return-from-subroutine instruction used by normal instructions).
There are 4 things that can cause a long latency between the end of the last byte of a message and the start of the UART handler:
Some previous byte in the message triggering the UART handler, and somehow it takes a long time before interrupts are re-enabled. Some people structure their interrupt routines so that after the UART handler finishes storing a byte in the appropriate buffer, it checks a bunch of other stuff before executing the return-from-interrupt instruction -- it increases jitter and latency, but sometimes those people do it anyway because it improves throughput.
Some other interrupt taking a long time to execute before it re-enables the interrupts by executing the return-from-interrupt instruction. (If you can make each and every interrupt turn on and off some other LED, it's pretty easy to see on the o'scope if this is the problem or to rule this out).
Some non-interrupt code "temporarily" turning off interrupts. (This increases jitter and latency, but people do it anyway, because it's often the easiest way to prevent data corruption when both some interrupt and some main-loop background task both work with the same piece of data). (If you can make every bit of code that does this turn on and off some other LED, it's pretty easy to see on the o'scope if this is the problem or to rule this out).
Instructions that take a long time to execute.
The traditional way to figure out exactly what is causing the problem
is to save the current version of your code (you're using TortoiseHg or some other version control system, right?),
and then deliberately hack and slash at a temporary copy of your code,
stubbing out and completely removing code a few subroutines at a time,
re-testing after each round of deletions,
until you have a tiny -- yet technically "complete" and runnable --
program that exhibits the same problem.
Far too often people show us bits and pieces of a complete program --
the parts those people think are relevant --
and we can't help them because one of the pieces they omitted is causing the problem.
The process of reducing a program to a small test case is a very useful skill,
because often while going through that process,
you quickly discover what the real problem is.
Once you have such a tiny -- yet runnable -- program,
please post it here.
If you figure out what the problem is during that process,
please tell us that as well, so the rest of us can avoid that problem.
There are several potential sources for noise in any circuit. Some of the most common include:
- Poorly regulated power supplies;
- Switching power supplies;
- Insufficient capacitive decoupling of the power rails near the MCU;
- Inductive coupling of nearby electromagnetic sources (including 50 or 60Hz from the mains power; even if the circuit is battery powered, it will experience this interference when close enough to a mains source);
- RF sources near the resonant frequency of a trace on the circuit board, or one of its harmonics;
- Routing of high-current traces on the circuit board near signal lines;
In addition (as @jippie mentioned), clock skew is a very common cause of errors in any type of serial communication that uses a predetermined data rate. If you're using an external crystal and interfacing to another system that can reasonably be expected to be accurate, it's less likely to cause problems. Internal oscillators, however, can have tolerances that are several orders of magnitude worse than crystals, and tend to vary more over temperature ranges.
There are several basic tests that can be performed on a running system to determine the basic noise (and skew) immunity of your interface, including:
- Freezing (cool the circuit to the minimum rating of its components);
- Baking (heat to the maximum rating);
- Exposure to EMI:
- Set the board on top of the power cord of a running space heater;
- Key a CB radio in the near vicinity of the board;
- Put the board next to your wireless router;
- Use long hookup wire (instead of a properly constructed serial cable) for the UART connection.
There are many others--in fact, there are large testing labs dedicated to EMC qualification.
In general, unless some minimal level of data loss is acceptable, it is always prudent to include some sort of error checking in your communications code. Even a simple checksum is better than nothing.
To actually answer your question, I usually discard anything received with error. That may include re-initializing the UART hardware, depending on what error it is and the details of the UART hardware.
The only exception is if you want to deliberately receive breaks. Those show up as framing errors. In that case you pass framing errors up to the higher levels as special conditions. However, that requires out of band information to be passed to the higher levels and therefore the UART receiver interface can't be seen as something quite as simple as getting a stream of bytes. I think I've done this exactly once in many microcontroller projects because it had to be compatible with a old system where breaks were used deliberately.
Steven has given you some good ideas what to do about this at the higher level. When you think there is a real chance of errors and data integrity is important, then you usually encapsulate chunks of data into packets with checksums. The receiver sends a ACK for every correctly received checksum.
However, the vast majority of the time UART errors are so unlikely and not absolutely critical that you can just ignore them at the high level. The kind of errors the UART hardware can catch are usually due to operator stupidity, not line noise. Most like noise will cause bad data, which the UART won't detect. So the low level UART driver throws out anything immediately associated with a UART error, but otherwise continues to pass the stream of received bytes up to the next level. In fact it does this even if you are using packets and checksums since that is done at a higher level than where individual bytes are received.