Electrical – How to implement precise delay function in Keil c51 (for c8051) that waits exact amount of CPU clocks, the number of clocks ranges from 1 to 255

c51delaykeiltimertiming

So as the title says, I need exact delays, not too long, ideally in the range of 0 – 350 CPU clocks, but if anything would work in narrower range the absolute minimum range is 20 – 127 CPU clocks.
So these are under or just above single micro-second delays (50MHz CPU clock), relatively short several clocks to several tens of clocks.
The problem with polling a timer, is that the precision results in a step of 7 clocks at max, depending on the implementation, for example:

  1. while(!TF0) {}
    While, not, and bit operator, all together take 7 clocks. So if I call anything in between the 15-21 clocks, it will result in the flat delay of 21 clocks …
  2. Using interrupt on timer and CPU stop mode – Gives good results for over 50 clocks, probably depends on current CPU condition, so sometimes goes way beyond 50 clocks, into 100 clocks range due to the interrupt and wake-up latency, but anything below again flat 50 (or 100) CPU clocks.
  3. Using switch-case, with for example 30 entries for 30 delays with 1 clock increment, having different number of NOPes as a delay, results in compiler optimization that makes it unpredictable in terms of timing and majorly too long again, over 100 clocks. This renders the approach unusable.
  4. I am planning to try table of pointer to functions with different number of NOPes. But before I try I already see two problems to that approach: a. it will require a lot of memory and I have 1k left; b. the latency of a void function(void) in and out is around 18 clocks so it is very very tight to meet the absolute minimum of 20 clocks I need …

How to approach this type of problem? Any ideas will be more than welcome?

By the way, I run it on C8051F38x microcontroller from Silicon Labs, using C51 and Keil to code and compile if that matters.

The code that emerged as a partial solution, seems like it is follows the same timing while loop in C does, and the "djnz" instruction takes 5-6 cycles of CPU, instead of the datasheet stated 2/4.

ACC_save=   ACC;
ACC     =   counter102;
P0b3    =   1;  //             Start the Pulse
#pragma ASM     //             Precice DELAY using assembler
clr C           //  ; 1       Clear Carry
rrc A           //  ; 1       C = 1 if odd
jnc even        //  ; 2 or 4  extra 2 cycles if branch taken (spoils cache)
nop             //  ; 1
nop             //  ; 1
clr   C         //  ; 1
even:
subb  A,#4      //  ; 1
mov   R7,A      //  ; 1
loop:
djnz  R7, loop  //  ; supposed to be 2, but practically takes 5 to 6 cycles!
#pragma ENDASM`
P0b3    =   0;  //             Stop the Pulse

EDIT

Thank you very much every one for great input, I couldn't imagine that the flow of ideas could be so positive and most important productive. So my deep appreciations to all who contributed, and will contribute in the future.
So after your valuable input guys, and great ideas I came up with something that works for me, to some extent. The code is below:

void    delay(unsigned char delay_time) {
switch (delay_time)
{case     8:    goto     Q08;
case      9:    goto     Q09;
case     10:    goto     Q10;
case     11:    goto     Q11;
case     12:    goto     Q12;
case     13:    goto     Q13;
case     14:    goto     Q14;
case     15:    goto     Q15;
case     16:    goto     Q16;
case     17:    goto     Q17;
case     18:    goto     Q18;
case     19:    goto     Q19;
case     20:    goto     Q20;
default      :  goto     Q00;   }

Q19:    PORT_ACTIVE(1); //  2clk
Q17:    PORT_ACTIVE(1); //  2clk
Q15:    PORT_ACTIVE(1); //  2clk
Q13:    PORT_ACTIVE(1); //  2clk
Q11:    PORT_ACTIVE(1); //  2clk
Q09:    PORT_ACTIVE(1); //  2clk
    _nop_();            //  1clk
goto EXIT1;             //  Skip the Even delay part

Q20:    PORT_ACTIVE(1); //  2clk
Q18:    PORT_ACTIVE(1); //  2clk
Q16:    PORT_ACTIVE(1); //  2clk
Q14:    PORT_ACTIVE(1); //  2clk
Q12:    PORT_ACTIVE(1); //  2clk
Q10:    PORT_ACTIVE(1); //  2clk
Q08:    PORT_ACTIVE(1); //  2clk
Q00:                    //  0clk
EXIT1:
return;                 //  Exit from the function takes 7 clocks
}   //  END of function delay

// Continued execution after the delay function
PORT_ACTIVE(0);     //  2clk

So PORT_ACTIVE(x) is a #define function that activates the pulsing port. Since I have all the time I need before I commence the pulse, I was able to squeeze in most of the overhead related with decisions before the actual activation of the port. Then, the return instruction is pretty much takes always the same amount of time so I am now able to generate a pulse with minimum of 8 clk cycles wide, and up to 20 cycles. I am now extending it up to 100 clocks, at the expense of the storage memory available of course. And so this solution is in fact thanks to the idea from JimmyB to drop the pulse activation into the function and not before it, and of course thanks to the great ideas by TCROSLEY, of how to manage the odd and even delays, its' just that switching to assembly is not really friendly to the debugging experience, and the code does much more than simpkle delays, and so I prefer to stay in C.

One more note, is that as soon as I finished celebrating a working solution, I hit the next problem.

SECOND PROBLEM

I need to execute a second pulse back to back to the first one with independent width. So no overhead for the second pulse, otherwise it will end up with varying width. It pretty much puts me back into the spot I was before, since the second pulse is again limited to the 6 cycles bottleneck of the while loop, unless there is a way to put the branching overhead for the second pulse before the first pulse …
Any ideas on that?

Best Answer

As others have mentioned, this is best done in assembly. Here is my original attempt at coding this, when I thought the jump instructions took either 2 or 4 cycles (see Edit below for the revised version).

void delay_sub(unsigned char i)
{
// convert 20, 21, 22 etc to count in R7 of 1, 2, 3 (extra cycle added if i is odd)
                    ; cycles
    rrc A           ; 1            c = 1 if odd
    jnc even        ; 2 or 4       extra 2 cycles if branch taken (spoils cache)
    nop             ; 1            delete if using lcall's instead of acall's
    nop             ; 1            same
    clc             ; 1            in either case carry is clear prior to subb
even:
    subb A,#9       ; 1

    mov R7,A        ; 1            R7 now = (i / 2) - 9
    //while (i--);
loop:
    djnz R7, loop   ; 2     loop address should be in cache, so no extra cycles needed
    ret             ; 6
}

timing calculation (assuming acall's)
if i even:
    5+7+R7*2+6 = minimum of 20 22 24 ... => R7 = 1, 2, 3 ...
if i odd:
    5+8+R7*2+6 = minimum of 21 23 25 ... => R7 = 1, 2, 3 ...

It assumes a call is made like ACALL(nn), where nn is a constant or a variable in a byte variable, so that the parameter can be passed using a one cycle MOV A,#n instruction for example. The minimum timing you can do is 20 clocks, as you asked for.

mov  A,#n        ; 1   
acall delay_sub  ; 4

There is no check that the parameter is greater than or equal to 20, any values less than 20 will give incorrect timing.

The mov instruction and acall will take 5 cycles. First off, the count (i) is divided by two to account for the DJNZ instruction taking two cycles. Then the count is adjusted to add a cycle if i is odd. Finally a fixed value is subtracted so the value in the register to be decremented (R7) is in the range 1, 2, 3 ... R7 is then decremented in a tight loop (two cycles per count). There is a fixed cycle count of 6 for the return.

If you have to use a LCALL instead of a ACALL, the minimum timing you can do will be 21 clocks instead of 20, and you will need to delete the two nop's after the jnc instruction. You have to use either all ACALL's or LCALL's, you can't mix them.

I would avoid using C to call the function unless you can guarantee the compiler doesn't add extra overhead. Also, I'm using R7 as a scratch register; your compiler manual will tell you what registers can be used inside an assembler function without having to save them (if any).

This also doesn't account for disabling and re-enabling interrupts, if necessary, to guarantee the timing routine will not be interrupted.

The behavior of the jump instructions are based on the datasheet for the C8051F38x as I understand it (in terms of when the instruction cache is spoiled or not). This may be different for other versions of the 8051.

Finally, I haven't shown the syntax for jumping into in-line assembly and back out again. The subroutine could also be put into a separate file and assembled.

Edit

Since I wrote the original code, the OP has informed me that the number of clock cycles for a jump in his 8051 is 5 or 6, not the 2 or 4 stated in the datasheet I read. So I have re-written the routine to take this into account. Unfortunately, this bumps the minimum cycle count that can be timed to 32 instead of 20. So if counts between 20 and 31 are absolutely needed to be handled, some special purpose code will need to be written specific to that case (see below).

void delay_sub(unsigned char i)
{
// minimum value of i is 32 
                    ; cycles
    clr C           ; 1
    subb A,#32      ; 1   adjust for overhead of call and this routine
                    ; a branch could be added here in case the result is negative
    mov B,#6        ; 1
    div AB          ; 4            quotient in A, remainder in B
    mov DPTR,#adjustcycles   ; 1
    mov R7,B        ; 1
    mov B,A         ; 1   save quotient in B as temp
    mov A,#6        ; 1
    clr C           ; 1
    subb A,R7       ; 1   A now has 5 - B (remainder)
    mov R7,#0       ; 1
    jmp @A+DPTR     ; 6   jump into table to add clocks based on remainder

adjustcycles:       ; execute additional cycles based on remainder
    inc R7          ; 1   for remainder of 5
    inc R7          ; 1   for remainder of 4 
    inc R7          ; 1   for remainder of 3 
    inc R7          ; 1   for remainder of 2
    inc R7          ; 1   for remainder of 1 
    nop             ; 1   for remainder of 0

    mov A,B         ; 1   now has i / 6, have already adjusted for remainder
loop:
    djnz loop       ; 6
    ret             ; 6

timing in clock cycles is: 5 (call) + 21 (fixed overhead) + 6*(i/6) + (i%6) + 6 (ret)

if i = 0, 5 + 21 + 6 = 32 therefore that is the minimum count

Instead of dividing the parameter i by 2 as in the previous example, I now have to divide it by 6 because I am assuming the DJNZ instruction takes 6 cycles. So we need to loop i / 6 times, and also add 0 to 5 cycles for the remainder (i % 6).

The remainder of my comments above pretty well apply to this example. I am leaving the original code, in case anyone actually has a 8051 with a two-cycle DJNZ instruction.

For counts of 20-31, you could create a subroutine with just one nop, that takes 12 cycles including the call and return:

void delay12(void)
{
    nop
}

For 20-23 counts, you would call it once plus adding 8 to 11 nops after the call (or a dummy jump to the next instruction which would eat up 6 cycles plus 2 to 5 nops -- so delaying 20 cycles would cost just four instructions plus the subroutine which is assumed to be used more than once.). For 24-31 counts, you would call delay12 twice, and add 0to 5 nops and/or a jump instruction as needed.

So to delay 20 cycles:

    acall delayl12
    jump next
next:
    nop
    nop