As others have mentioned, you should consider a IIR (infinite impulse response) filter rather than the FIR (finite impulse response) filter you are using now. There is more to it, but at first glance FIR filters are implemented as explicit convolutions and IIR filters with equations.
The particular IIR filter I use a lot in microcontrollers is a single pole low pass filter. This is the digital equivalent of a simple R-C analog filter. For most applications, these will have better characteristics than the box filter that you are using. Most uses of a box filter that I have encountered are a result of someone not paying attention in digital signal processing class, not as a result of needing their particular characteristics. If you just want to attenuate high frequencies that you know are noise, a single pole low pass filter is better. The best way to implement one digitally in a microcontroller is usually:
FILT <-- FILT + FF(NEW - FILT)
FILT is a piece of persistant state. This is the only persistant variable you need to compute this filter. NEW is the new value that the filter is being updated with this iteration. FF is the filter fraction, which adjusts the "heaviness" of the filter. Look at this algorithm and see that for FF = 0 the filter is infinitely heavy since the output never changes. For FF = 1, it's really no filter at all since the output just follows the input. Useful values are in between. On small systems you pick FF to be 1/2N so that the multiply by FF can be accomplished as a right shift by N bits. For example, FF might be 1/16 and the multiply by FF therefore a right shift of 4 bits. Otherwise this filter needs only one subtract and one add, although the numbers usually need to be wider than the input value (more on numerical precision in a separate section below).
I usually take A/D readings significantly faster than they are needed and apply two of these filters cascaded. This is the digital equivalent of two R-C filters in series, and attenuates by 12 dB/octave above the rolloff frequency. However, for A/D readings it's usually more relevant to look at the filter in the time domain by considering its step response. This tells you how fast your system will see a change when the thing you are measuring changes.
To facilitate designing these filters (which only means picking FF and deciding how many of them to cascade), I use my program FILTBITS. You specify the number of shift bits for each FF in the cascaded series of filters, and it computes the step response and other values. Actually I usually run this via my wrapper script PLOTFILT. This runs FILTBITS, which makes a CSV file, then plots the CSV file. For example, here is the result of "PLOTFILT 4 4":
The two parameters to PLOTFILT mean there will be two filters cascaded of the type described above. The values of 4 indicate the number of shift bits to realize the multiply by FF. The two FF values are therefore 1/16 in this case.
The red trace is the unit step response, and is the main thing to look at. For example, this tells you that if the input changes instantaneously, the output of the combined filter will settle to 90% of the new value in 60 iterations. If you care about 95% settling time then you have to wait about 73 iterations, and for 50% settling time only 26 iterations.
The green trace shows you the output from a single full amplitude spike. This gives you some idea of the random noise suppression. It looks like no single sample will cause more than a 2.5% change in the output.
The blue trace is to give a subjective feeling of what this filter does with white noise. This is not a rigorous test since there is no guarantee what exactly the content was of the random numbers picked as the white noise input for this run of PLOTFILT. It's only to give you a rough feeling of how much it will be squashed and how smooth it is.
PLOTFILT, maybe FILTBITS, and lots of other useful stuff, especially for PIC firmware development is available in the PIC Development Tools software release at my Software downloads page.
Added about numerical precision
I see from the comments and now a new answer that there is interest in discussing the number of bits needed to implement this filter. Note that the multiply by FF will create Log2(FF) new bits below the binary point. On small systems, FF is usually chosen to be 1/2N so that this multiply is actually realized by a right shift of N bits.
FILT is therefore usually a fixed point integer. Note that this doesn't change any of the math from the processor's point of view. For example, if you are filtering 10 bit A/D readings and N = 4 (FF = 1/16), then you need 4 fraction bits below the 10 bit integer A/D readings. One most processors, you'd be doing 16 bit integer operations due to the 10 bit A/D readings. In this case, you can still do exactly the same 16 bit integer opertions, but start with the A/D readings left shifted by 4 bits. The processor doesn't know the difference and doesn't need to. Doing the math on whole 16 bit integers works whether you consider them to be 12.4 fixed point or true 16 bit integers (16.0 fixed point).
In general, you need to add N bits each filter pole if you don't want to add noise due to the numerical representation. In the example above, the second filter of two would have to have 10+4+4 = 18 bits to not lose information. In practise on a 8 bit machine that means you'd use 24 bit values. Technically only the second pole of two would need the wider value, but for firmware simplicity I usually use the same representation, and thereby the same code, for all poles of a filter.
Usually I write a subroutine or macro to perform one filter pole operation, then apply that to each pole. Whether a subroutine or macro depends on whether cycles or program memory are more important in that particular project. Either way, I use some scratch state to pass NEW into the subroutine/macro, which updates FILT, but also loads that into the same scratch state NEW was in. This makes it easy to apply multiple poles since the updated FILT of one pole is the NEW of the next one. When a subroutine, it's useful to have a pointer point to FILT on the way in, which is updated to just after FILT on the way out. That way the subroutine automatically operates on consecutive filters in memory if called multiple times. With a macro you don't need a pointer since you pass in the address to operate on each iteration.
Code Examples
Here is a example of a macro as described above for a PIC 18:
////////////////////////////////////////////////////////////////////////////////
//
// Macro FILTER filt
//
// Update one filter pole with the new value in NEWVAL. NEWVAL is updated to
// contain the new filtered value.
//
// FILT is the name of the filter state variable. It is assumed to be 24 bits
// wide and in the local bank.
//
// The formula for updating the filter is:
//
// FILT <-- FILT + FF(NEWVAL - FILT)
//
// The multiply by FF is accomplished by a right shift of FILTBITS bits.
//
/macro filter
/write
dbankif lbankadr
movf [arg 1]+0, w ;NEWVAL <-- NEWVAL - FILT
subwf newval+0
movf [arg 1]+1, w
subwfb newval+1
movf [arg 1]+2, w
subwfb newval+2
/write
/loop n filtbits ;once for each bit to shift NEWVAL right
rlcf newval+2, w ;shift NEWVAL right one bit
rrcf newval+2
rrcf newval+1
rrcf newval+0
/endloop
/write
movf newval+0, w ;add shifted value into filter and save in NEWVAL
addwf [arg 1]+0, w
movwf [arg 1]+0
movwf newval+0
movf newval+1, w
addwfc [arg 1]+1, w
movwf [arg 1]+1
movwf newval+1
movf newval+2, w
addwfc [arg 1]+2, w
movwf [arg 1]+2
movwf newval+2
/endmac
And here is a similar macro for a PIC 24 or dsPIC 30 or 33:
////////////////////////////////////////////////////////////////////////////////
//
// Macro FILTER ffbits
//
// Update the state of one low pass filter. The new input value is in W1:W0
// and the filter state to be updated is pointed to by W2.
//
// The updated filter value will also be returned in W1:W0 and W2 will point
// to the first memory past the filter state. This macro can therefore be
// invoked in succession to update a series of cascaded low pass filters.
//
// The filter formula is:
//
// FILT <-- FILT + FF(NEW - FILT)
//
// where the multiply by FF is performed by a arithmetic right shift of
// FFBITS.
//
// WARNING: W3 is trashed.
//
/macro filter
/var new ffbits integer = [arg 1] ;get number of bits to shift
/write
/write " ; Perform one pole low pass filtering, shift bits = " ffbits
/write " ;"
sub w0, [w2++], w0 ;NEW - FILT --> W1:W0
subb w1, [w2--], w1
lsr w0, #[v ffbits], w0 ;shift the result in W1:W0 right
sl w1, #[- 16 ffbits], w3
ior w0, w3, w0
asr w1, #[v ffbits], w1
add w0, [w2++], w0 ;add FILT to make final result in W1:W0
addc w1, [w2--], w1
mov w0, [w2++] ;write result to the filter state, advance pointer
mov w1, [w2++]
/write
/endmac
Both these examples are implemented as macros using my PIC assembler preprocessor, which is more capable than either of the built-in macro facilities.
The problem with your approach is that the EEPROM data is permanent, whereas the variable is volatile. Once the device is power cycled, how will you check the data in the future? The variable is gone, and all you have is the EEPROM copy.
What you have so far is not a data integrity strategy, but only a write verification step (which is wortwhile; don't misunderstand).
You could store checksums along with the data; however, checksums can only tell you that the data is corrupted. This is better than proceeding with corrupted data; however, if the data is critical, it means that the device has failed.
A more robust solution is to store the data in such a way that not only error detection is possible, but also error correction.
You can implement a Hamming codes for individual words of the data. A Hamming code can recover from a single-bit error; more with some extensions.
If you have lots of space on the EEPROM, you can implement redundancy. For instance, you can split the EEPROM into two halves which mirror each other, similar to a RAID0 scheme used for hard disks. Write every unit of data in both partitions, with block checksums. When reading data, if a checksum is bad, you can try the other copy it in the mirrored partition. Chances are its checksum is good. (And if so, you can overwrite the bad copy with the good copy to repair it.)
Best Answer
To process the bits as efficiently as possible, you're going to want to keep them packed into 32-bit words whereever it makes sense. 816 bits is 25.5 words, which really isn't bad at all.
To search for ones efficiently, break the task into two steps: Check entire words for non-zero in the outer loop, then search for individual bits in the words that aren't all-zero in the inner loop.
One trick that can be used to isolate individual bits in a word that has multiple bits set is to AND the word with its negation (2's complement). The result has just a single bit set — the rightmost 'one' in the original word. You can then use this result to clear that bit and look for the next 'one'.
In C:
For example, suppose your 32-bit word contains 0x40010080:
EDIT:
The efficiency of this algorithm comes from the fact that you only iterate once for each 'one' (three of them in this example), rather than once for each of the 32 bits in the word. This is a huge advantage if you're doing something like simply counting the 'one's. The downside is that it doesn't give you the bit index directly, but there are ways to accelerate that as well. For example, to get the bit index of a single bit set in a word, you could use a binary encoding algorithm:
The overall advantage of the two algorithms above, as compared to a brute-force iteration through the bits, depends on how many bits you expect to find in each word. If they're rare (less than about 4 per word), then these algorithms should be faster. If they're more common than that, just go with the iterative loop: