Electronic – 8GB SD cards intermittently overwriting data

embeddedrenesassdspi

I have inherited a custom embedded device which uses an Renesas H8S MCU to store data on an SD card. It does this directly with no filesystem implementing SPI at a low level in firmware written in assembler. The original (MMC, pre SPI) version worked for many years without a problem on 2GB SD cards.

A while back we started installing 8GB Kingston cards into the devices, as our traditional 2GB card choice were becoming scarce, so we had to have some modifications made to the firmware read/write routines to support the "new" SPI modes and timings. The developer got this working and thought it was all fine.

Now we have found that some of our data is being intermittently overwritten with blocks of 0x55, ie. like 55555555555555555… (this is 85d, or interestingly 101010101 in binary)

During normal operation, our firmware buffers some incoming data in RAM and NVRAM until it reaches a 512 block, then it writes a single block to the SDC, keeping track of the next address. It keeps writing to subsequent 512 blocks, never backtracking as it's adding to a log.

However when we read the data back, sometimes the older data has been overwritten with 0x55 in chunks of some currently unknown page size (at least 128 bytes, so probably 512). It seems these 55's are being written 'below' our data and sometimes overlap it.

We can debug the firmware (with HEW and the USB hardware emulator) but cannot see a problem in our code. I'm a bit blind to what's going on in the SDC as the only way I know to look at it, is to remove and image it (using dd on my mac) – which is very slow. (In fact I've had trouble seeing anything sensible on the 8GB SD images.)

I'm pretty sure the data is corrupted on the card during write or shortly after, and not being read back wrongly. We have two fairly different read routines and they both return the same data. Debugging shows that the 55 data is coming off the card, and not being broken later in the processing.

So I can only guess that it's some race-condition, some internal caching or buffering in the SDC, or possibly the erase-block-size which might be overwriting larger areas than the 512 blocks we are writing.

The card is used for some other purposes in other, lower reserved areas, with various duty cycles in our firmware and also interrupts so it's possible there's a conflict there, but I would have thought this would have happened years ago.

It only happens with these 8BG cards. We haven't managed to try any other card variants/brands yet but do plan to.

So my questions are:

  1. are there known risks and common problems like this writing directly to the card and avoiding an FS? (like block alignment)
  2. do the larger cards internally buffer or cache data?
  3. is 0x55 a known "filler" value (as opposed to the normal 0xFF erase)
  4. Can anyone think of what is causing the intermittent overwrites?

UPDATE

Today I tested as many cards as I could find, and found all but one fail in similar, but sometimes different ways.

Result  Card
FAIL    8GB Kingston Ultra microSDHC Class 4 (CO8G Taiwan), production.
PASS    4GB Verbatim microSD HC Class 4 - brand new
FAIL    16GB SanDisk Ultra microSDHC 80mb/s 533x - new
FAIL    8GB SanDisk microSDHC I Class 4 "BI Made in China", reused.
FAIL    8GB SanDisk ULTRA microSD HC I - UHS Class 1, production.
FAIL    4GB SanDisk microSDHC class4 - new.
FAIL*   ADATA 4GB microSDHC class4 - new.
PASS    Transcend 2GB microSD (SC?) 
ON ORDER    Transcend 8GB microSD   
PASS    Verbatim 8GB microSDHC Class 10 
PASS    Kingston 8GB microSDHC Class 10 / UHC 1
FAIL    Un-branded Taiwan 8GB SDHC Class 10

Of these cards, the "worst" is the 16Gb Sandisk which seemed unpredictable, and even moved data around after it was written as subsequent views of the data are different. The ADATA card was also different, which needed two reads to read the true data. On the first read, the data was only partially visible, but on the second read it looked correct.

Most of them behaved in the same way, as described above, with old blocks looking like they were overwritten when subsequent writes move into new blocks (probably). Some of them failed almost immediately after one or two 512 block writes, much worse than our primary Kingston 8GB cards.

After watching the data strangely move around on these disks, I currently feel like the larger disks are "doing clever stuff" to the data (like buffering or caching) which we are only experiencing because we are writing/reading directly to memory addresses rather than respecting the filesystem layer (even if there isn't a specific bug in our code).

UPDATE2

I have clarified and simplified the problem scenario, showing that it's more simply:

  • I have some A-logs stored in one area of the SD card
  • A routine writes some B-logs to another area in one 512 block
  • When I read the A-logs they look corrupted
  • When I re-read the A-logs, they "fix" themselves, on the third read, and return the true data.
  • Most of the cards I've tested show 00000000 in the data when it's corrupted.

From this an much reading, I am now thinking the corruption is false, the data on the card is fine, and the data returned by our log reader is perhaps the "stream of busy-tokens" or DO still being held low from the write process. It feels like the process of reading the A-logs is finishing off clocking the previous write of the B-logs.

However looking at the code, it seems like the protocol is being respected, and the write macro doesn't exit until DO goes high, after sending clocks in a loop (i.e. not 8 or 32, just however many it needs!)

This is the assembly.

The part I've been experimenting with is the _MMCSPIWrite512ByteBlockFromer2Accept end loop, which I tried adding various things to:

  • Added read and/or write cycles instead of just clocks in the loop
  • I added 8 dummy clocks just before lowering CS
  • I added 32 dummy clocks just before lowering CS

The _MMCSPISendByter0l macro sends 8 clocks with the bits.

On stepping through it, it seems to be doing roughly the right process. It's hard to count the exact iterations in the various loops though (in HEW emulator, bit slow), and any debug logging is just too hard in asm! And also stepping seems to affect the timing, and sometimes it locks up.

This macro _MMCSPIWrite512ByteBlockFromer2 is the primary function to write a 512 block to the SD card, from a ram buffer pointed to from er2.

I've checked the simplified SPI spec, and the error/accept token part seems to be right.

As I said before this write algorithm has many years of successful practise in thousands of devices – on some cards. So I don't expect it to be such a basic error in the protocol. I just can't see the problem which seems to leave the read state reading 000000's – for possibly around the first 512 bytes read (or occasionally h'55s – which might actually be coming from elsewhere in our application).

    _MMCSPIWrite512ByteBlockFromer2 
        bset    H_MMCSPIDI                   ; raise DI
        bclr    H_MMCSPInCS                  ; lower CS
        mov.b   #h'ff,r0l                               ;8 clock cycles Ncs
        bsr     _MMCSPISendByter0l
        mov.b   #C_MMCSPICMD24,r0l                      ;cmd24 is the write block command
        bclr    #7,r0l                                  ;b'01 mask
        bset    #6,r0l
        bsr     _MMCSPISendByter0l
        bsr     _MMCSPISendArguments                    ;send four arguments/address
        mov.b   #h'ff,r0l                               ;h'ff is the unused dummy crc
        bsr     _MMCSPISendByter0l
        bset    H_MMCSPIDI                              ;send hi data with clocks till response obtained

    _MMCSPIWrite512ByteBlockFromer2WaitForResponse
        btst    H_MMCSPIDO                              ;check for response, send clocks till response zero bit arrives
        beq     _MMCSPIWrite512ByteBlockFromer2Responded
        bsr     _MMCSPIClock
        bra     _MMCSPIWrite512ByteBlockFromer2WaitForResponse
    _MMCSPIWrite512ByteBlockFromer2Responded
        bsr     _MMCSPIReceive7Bitsr0l                  ;response should be h'00 else an error
        bne     _MMCSPIWrite512ByteBlockFromer2Error
        mov.b   #h'ff,r0l                               ;send Nwr
        bsr     _MMCSPISendByter0l
        mov.b   #h'fe,r0l                               ;send start of data header
        bsr     _MMCSPISendByter0l
    _MMCSPIWrite512ByteBlockFromer2DataStart
        mov.w   #d'512,e0                               ;send 512 byte block of data
    _MMCSPIWrite512ByteBlockFromer2DataLoop
        mov.b   @er2,r0l                                ;write data from er2 pointer, byte at a time, 512 times
        bsr _MMCSPISendByter0l
        inc.l   #1,er2
        dec.w   #1,e0
        bne     _MMCSPIWrite512ByteBlockFromer2DataLoop
    _MMCSPIWrite512ByteBlockFromer2CRC
        mov.b   #h'ff,r0l                               ;2 dummy crc bytes
        bsr _MMCSPISendByter0l
        mov.b   #h'ff,r0l
        bsr _MMCSPISendByter0l
     _MMCSPIWrite512ByteBlockFromer2CheckResponse
        rotl.b  r0l                                     ;get next lsb of response byte ready
        bsr     _MMCSPIClock
        bld     H_MMCSPIDO                              ;place mmc data output into response byte lsb
        bst     #0,r0l
        mov.b   r0l,r0h
        and.b   #b'00011111,r0h                         ;mask inappropriate bits of response and check
        cmp     #b'00001011,r0h                         ;check if a reject token?
        beq     _MMCSPIWrite512ByteBlockFromer2Reject
        cmp     #b'00000101,r0h                         ;check if an accept token?
        beq     _MMCSPIWrite512ByteBlockFromer2Accept
        bra     _MMCSPIWrite512ByteBlockFromer2CheckResponse    ;loop till reject or accept received

    _MMCSPIWrite512ByteBlockFromer2Accept
        bsr     _MMCSPIClock                            ;provide clocks for mmc writing clock requirements
        btst    H_MMCSPIDO
        beq     _MMCSPIWrite512ByteBlockFromer2Accept   ;keep clocking till write complete ack'd with a high DO

        bset    H_MMCSPInCS
        mov.b   #h'00,r0l
        rts
    _MMCSPIWrite512ByteBlockFromer2Error
    _MMCSPIWrite512ByteBlockFromer2Reject
        mov.b   #h'ff,r0l                               ;8 clock Nec
        bsr     _MMCSPISendByter0l
        bset    H_MMCSPInCS
        mov.b   #h'ff,r0l
        rts

Update 3

Today I focused on two things: the block size suggestion, and examining the READ routine – on thinking more about the fact the data seems written correctly, but read intermittently.

Firstly, I have tried sending CMD16 to set the block size to 512, but

a) it didn't have any effect on the failures (if it worked)
b) I'm not 100% sure I got it right in assembler – although while debugging I did get a "parameter error" bit set in the response, which I resolved and then got a 0x00 response, so I think I'm making the command call right…
c) On reading the simplified SPI spec, it clearly says the block size is 512 for SDHC and this command is not used. So many people say you should set the block size – is this for MMC cards? or is the spec wrong?

In the case of a Standard Capacity SD
Memory Card, this command sets the
block length (in bytes) for all following
block commands (read, write, lock).
Default block length is fixed to 512
Bytes. Set length is valid for memory
access commands only if partial block
read operation are allowed in CSD.
In the case of SDHC and SDXC Cards,
block length set by CMD16 command
does not affect memory read and write
commands. Always 512 Bytes fixed
block length is used. This command is
effective for LOCK_UNLOCK command.
In both cases, if block length is set larger
than 512Bytes, the card sets the
BLOCK_LEN_ERROR bit.
In DDR50 mode, data is sampled on
both edges of the clock.

Secondly I examined the read routine and as before, it seems generally correct. I previously tried adding more "dummy clocks" with no effect.

The routine that reads from SPI is as follows.

_MMCSPIRead512ByteBlockToer2
        bset    H_MMCSPIDI
        bclr    H_MMCSPInCS
        mov.b   #h'ff,r0l                               ;8 clock cycles Ncs
        bsr     _MMCSPISendByter0l
        mov.b   #C_MMCSPICMD17,r0l                      ;cmd17 is the read block command
        bclr    #7,r0l                                  ;b'01 mask
        bset    #6,r0l
        bsr     _MMCSPISendByter0l
        bsr     _MMCSPISendArguments                    ;send four arguments/address
        mov.b   #h'ff,r0l                               ;h'ff is the unused dummy crc
        bsr     _MMCSPISendByter0l
        bset    H_MMCSPIDI                              ;send hi data with clocks till response obtained
_MMCSPIRead512ByteBlockToer2WaitForResponse
        btst    H_MMCSPIDO                              ;check for response, send clocks till response zero bit arrives
        beq     _MMCSPIRead512ByteBlockToer2Responded
        bsr     _MMCSPIClock
        bra     _MMCSPIRead512ByteBlockToer2WaitForResponse     
_MMCSPIRead512ByteBlockToer2Responded
        bsr     _MMCSPIReceive7Bitsr0l                  ;response should be h'00 else an error
        bne     _MMCSPIRead512ByteBlockToer2Error
        bset    H_MMCSPIDI                              ;send hi data with clocks till response obtained
        bsr     _MMCSPIClock                            ;clock ensures previous zero on DO is removed
_MMCSPIRead512ByteBlockToer2WaitForData
        btst    H_MMCSPIDO                              ;check for response, send clocks till response zero bit arrives
        beq     _MMCSPIRead512ByteBlockToer2DataStart   ; because the response desired is h'fe , so last bit of this
        bsr     _MMCSPIClock                            ; response signals the start of data
        bra     _MMCSPIRead512ByteBlockToer2WaitForData     
_MMCSPIRead512ByteBlockToer2DataStart
        mov.w   #d'512,e0                               ;get 512 byte block of data
_MMCSPIRead512ByteBlockToer2DataLoop
        bsr     _MMCSPIReceiver0l                       ;get data and store into er2 pointer, 512 times
        mov.b   r0l,@er2
        inc.l   #1,er2
        dec.w   #1,e0
        bne     _MMCSPIRead512ByteBlockToer2DataLoop
_MMCSPIRead512ByteBlockToer2End
        bsr     _MMCSPIReceiver0l                       ;get 2 crc bytes
        bsr     _MMCSPIReceiver0l
        bsr     _MMCSPIReceiver0l                       ;a few more reads to endsure data is completely flushed
        bsr     _MMCSPIReceiver0l
        bsr     _MMCSPIReceiver0l                       
        bset    H_MMCSPInCS
        mov.b   #h'00,r0l
        rts
_MMCSPIRead512ByteBlockToer2Error
        mov.b   #h'ff,r0l                               ;8 clock Nec
        bsr     _MMCSPISendByter0l
        bset    H_MMCSPInCS
        mov.b   #h'ff,r0l
        rts

Update 4

After a writing some routines to manually write to the various logs, I can now reliably reproduce the problem with the simplest possible set of steps. I have also got more familiar with debugging in HEW and inspecting various memory locations.

It turns out that most of the effects I was seeing were due to old data.

After writing to the B-Logs, the SPI command for reading from the A-Logs are returning errors from the card. However our firmware doesn't report this fact, nor return an error result to the higher functions. So the higher log-processing functions continue to read the data from a memory buffer which hasn't been changed. The buffer has various data in it from previous logs read/writes and when it's further decoded (from hex in binary, to decimal in ascii) all sorts of false-data appears. The 5555's are actually coming mostly from 0xFF's in RAM.

So while I still don't know why the SPI read command is erroring, I can close this question as it's mostly answered and was probably too broad.

Best Answer

are there known risks and common problems like this writing directly to the card and avoiding an FS? (like block alignment)

Generally no, using the card as a simple block device is perfectly safe. However, without an FS layer, you're responsible for what you do yourself. So if you have a glitch in your code, it may affect the SD card more.

do the larger cards internally buffer or cache data?

No. They write the block and once they no longer indicate "busy" (R1b) the data is persistent. - You do poll the busy state before operating any further on the card, don't you?

is 0x55 a known "filler" value (as opposed to the normal 0xFF erase)

No. Erased blocks read as 0xFF or 0x00 (depending on card).

Can anyone think of what is causing the intermittent overwrites?

I find it interesting that the corrupt data is consistently 0x55.

Thinking about it, are you aware that SDSC cards (<= 2GB) are byte-adressable whereas SDHC and SDXC cards are block-addressable only? Hence, sending address 0x0200 to an SDSC card will address the beginning of the second block of 512 bytes, whereas the same address on an SDHC/XC card will refer to the beginning of the 512th block of the card, i.e. byte address 262144. If you don't check and respect the type of card, you may be heading for trouble, or at least not writing data to where you think it should be.

Also, be sure to await the end of each operation carried out by the card.

One 8GB card I tried did not work with only the standard 8 "dummy" SPI-clocks after each command. Now, to be safe, I always send 32 clocks (4 bytes) and everything works like a charm.

You may want to check the block size of your cards, or, to be safe, set it to 512 bytes (CMD16) for all cards during init. Otherwise, the card may be expecting more data while you think you already sent enough and bad things will happen to your data.