Electrical – How to implement precise delay function in Keil c51 (for c8051) that waits exact amount of CPU clocks, the number of clocks ranges from 1 to 255

c51delaykeiltimertiming

So as the title says, I need exact delays, not too long, ideally in the range of 0 – 350 CPU clocks, but if anything would work in narrower range the absolute minimum range is 20 – 127 CPU clocks.
So these are under or just above single micro-second delays (50MHz CPU clock), relatively short several clocks to several tens of clocks.
The problem with polling a timer, is that the precision results in a step of 7 clocks at max, depending on the implementation, for example:

while(!TF0) {}
While, not, and bit operator, all together take 7 clocks. So if I call anything in between the 15-21 clocks, it will result in the flat delay of 21 clocks …
Using interrupt on timer and CPU stop mode – Gives good results for over 50 clocks, probably depends on current CPU condition, so sometimes goes way beyond 50 clocks, into 100 clocks range due to the interrupt and wake-up latency, but anything below again flat 50 (or 100) CPU clocks.
Using switch-case, with for example 30 entries for 30 delays with 1 clock increment, having different number of NOPes as a delay, results in compiler optimization that makes it unpredictable in terms of timing and majorly too long again, over 100 clocks. This renders the approach unusable.
I am planning to try table of pointer to functions with different number of NOPes. But before I try I already see two problems to that approach: a. it will require a lot of memory and I have 1k left; b. the latency of a void function(void) in and out is around 18 clocks so it is very very tight to meet the absolute minimum of 20 clocks I need …

How to approach this type of problem? Any ideas will be more than welcome?

By the way, I run it on C8051F38x microcontroller from Silicon Labs, using C51 and Keil to code and compile if that matters.

The code that emerged as a partial solution, seems like it is follows the same timing while loop in C does, and the "djnz" instruction takes 5-6 cycles of CPU, instead of the datasheet stated 2/4.

ACC_save=   ACC;
ACC     =   counter102;
P0b3    =   1;  //             Start the Pulse
#pragma ASM     //             Precice DELAY using assembler
clr C           //  ; 1       Clear Carry
rrc A           //  ; 1       C = 1 if odd
jnc even        //  ; 2 or 4  extra 2 cycles if branch taken (spoils cache)
nop             //  ; 1
nop             //  ; 1
clr   C         //  ; 1
even:
subb  A,#4      //  ; 1
mov   R7,A      //  ; 1
loop:
djnz  R7, loop  //  ; supposed to be 2, but practically takes 5 to 6 cycles!
#pragma ENDASM`
P0b3    =   0;  //             Stop the Pulse

EDIT

Thank you very much every one for great input, I couldn't imagine that the flow of ideas could be so positive and most important productive. So my deep appreciations to all who contributed, and will contribute in the future.
So after your valuable input guys, and great ideas I came up with something that works for me, to some extent. The code is below:

void    delay(unsigned char delay_time) {
switch (delay_time)
{case     8:    goto     Q08;
case      9:    goto     Q09;
case     10:    goto     Q10;
case     11:    goto     Q11;
case     12:    goto     Q12;
case     13:    goto     Q13;
case     14:    goto     Q14;
case     15:    goto     Q15;
case     16:    goto     Q16;
case     17:    goto     Q17;
case     18:    goto     Q18;
case     19:    goto     Q19;
case     20:    goto     Q20;
default      :  goto     Q00;   }

Q19:    PORT_ACTIVE(1); //  2clk
Q17:    PORT_ACTIVE(1); //  2clk
Q15:    PORT_ACTIVE(1); //  2clk
Q13:    PORT_ACTIVE(1); //  2clk
Q11:    PORT_ACTIVE(1); //  2clk
Q09:    PORT_ACTIVE(1); //  2clk
    _nop_();            //  1clk
goto EXIT1;             //  Skip the Even delay part

Q20:    PORT_ACTIVE(1); //  2clk
Q18:    PORT_ACTIVE(1); //  2clk
Q16:    PORT_ACTIVE(1); //  2clk
Q14:    PORT_ACTIVE(1); //  2clk
Q12:    PORT_ACTIVE(1); //  2clk
Q10:    PORT_ACTIVE(1); //  2clk
Q08:    PORT_ACTIVE(1); //  2clk
Q00:                    //  0clk
EXIT1:
return;                 //  Exit from the function takes 7 clocks
}   //  END of function delay

// Continued execution after the delay function
PORT_ACTIVE(0);     //  2clk

So PORT_ACTIVE(x) is a #define function that activates the pulsing port. Since I have all the time I need before I commence the pulse, I was able to squeeze in most of the overhead related with decisions before the actual activation of the port. Then, the return instruction is pretty much takes always the same amount of time so I am now able to generate a pulse with minimum of 8 clk cycles wide, and up to 20 cycles. I am now extending it up to 100 clocks, at the expense of the storage memory available of course. And so this solution is in fact thanks to the idea from JimmyB to drop the pulse activation into the function and not before it, and of course thanks to the great ideas by TCROSLEY, of how to manage the odd and even delays, its' just that switching to assembly is not really friendly to the debugging experience, and the code does much more than simpkle delays, and so I prefer to stay in C.

One more note, is that as soon as I finished celebrating a working solution, I hit the next problem.

SECOND PROBLEM

I need to execute a second pulse back to back to the first one with independent width. So no overhead for the second pulse, otherwise it will end up with varying width. It pretty much puts me back into the spot I was before, since the second pulse is again limited to the 6 cycles bottleneck of the while loop, unless there is a way to put the branching overhead for the second pulse before the first pulse …
Any ideas on that?

Best Answer

As others have mentioned, this is best done in assembly. Here is my original attempt at coding this, when I thought the jump instructions took either 2 or 4 cycles (see Edit below for the revised version).

void delay_sub(unsigned char i)
{
// convert 20, 21, 22 etc to count in R7 of 1, 2, 3 (extra cycle added if i is odd)
                    ; cycles
    rrc A           ; 1            c = 1 if odd
    jnc even        ; 2 or 4       extra 2 cycles if branch taken (spoils cache)
    nop             ; 1            delete if using lcall's instead of acall's
    nop             ; 1            same
    clc             ; 1            in either case carry is clear prior to subb
even:
    subb A,#9       ; 1

    mov R7,A        ; 1            R7 now = (i / 2) - 9
    //while (i--);
loop:
    djnz R7, loop   ; 2     loop address should be in cache, so no extra cycles needed
    ret             ; 6
}

timing calculation (assuming acall's)
if i even:
    5+7+R7*2+6 = minimum of 20 22 24 ... => R7 = 1, 2, 3 ...
if i odd:
    5+8+R7*2+6 = minimum of 21 23 25 ... => R7 = 1, 2, 3 ...

It assumes a call is made like ACALL(nn), where nn is a constant or a variable in a byte variable, so that the parameter can be passed using a one cycle MOV A,#n instruction for example. The minimum timing you can do is 20 clocks, as you asked for.

mov  A,#n        ; 1   
acall delay_sub  ; 4

There is no check that the parameter is greater than or equal to 20, any values less than 20 will give incorrect timing.

The mov instruction and acall will take 5 cycles. First off, the count (i) is divided by two to account for the DJNZ instruction taking two cycles. Then the count is adjusted to add a cycle if i is odd. Finally a fixed value is subtracted so the value in the register to be decremented (R7) is in the range 1, 2, 3 ... R7 is then decremented in a tight loop (two cycles per count). There is a fixed cycle count of 6 for the return.

If you have to use a LCALL instead of a ACALL, the minimum timing you can do will be 21 clocks instead of 20, and you will need to delete the two nop's after the jnc instruction. You have to use either all ACALL's or LCALL's, you can't mix them.

I would avoid using C to call the function unless you can guarantee the compiler doesn't add extra overhead. Also, I'm using R7 as a scratch register; your compiler manual will tell you what registers can be used inside an assembler function without having to save them (if any).

This also doesn't account for disabling and re-enabling interrupts, if necessary, to guarantee the timing routine will not be interrupted.

The behavior of the jump instructions are based on the datasheet for the C8051F38x as I understand it (in terms of when the instruction cache is spoiled or not). This may be different for other versions of the 8051.

Finally, I haven't shown the syntax for jumping into in-line assembly and back out again. The subroutine could also be put into a separate file and assembled.

Edit

Since I wrote the original code, the OP has informed me that the number of clock cycles for a jump in his 8051 is 5 or 6, not the 2 or 4 stated in the datasheet I read. So I have re-written the routine to take this into account. Unfortunately, this bumps the minimum cycle count that can be timed to 32 instead of 20. So if counts between 20 and 31 are absolutely needed to be handled, some special purpose code will need to be written specific to that case (see below).

void delay_sub(unsigned char i)
{
// minimum value of i is 32 
                    ; cycles
    clr C           ; 1
    subb A,#32      ; 1   adjust for overhead of call and this routine
                    ; a branch could be added here in case the result is negative
    mov B,#6        ; 1
    div AB          ; 4            quotient in A, remainder in B
    mov DPTR,#adjustcycles   ; 1
    mov R7,B        ; 1
    mov B,A         ; 1   save quotient in B as temp
    mov A,#6        ; 1
    clr C           ; 1
    subb A,R7       ; 1   A now has 5 - B (remainder)
    mov R7,#0       ; 1
    jmp @A+DPTR     ; 6   jump into table to add clocks based on remainder

adjustcycles:       ; execute additional cycles based on remainder
    inc R7          ; 1   for remainder of 5
    inc R7          ; 1   for remainder of 4 
    inc R7          ; 1   for remainder of 3 
    inc R7          ; 1   for remainder of 2
    inc R7          ; 1   for remainder of 1 
    nop             ; 1   for remainder of 0

    mov A,B         ; 1   now has i / 6, have already adjusted for remainder
loop:
    djnz loop       ; 6
    ret             ; 6

timing in clock cycles is: 5 (call) + 21 (fixed overhead) + 6*(i/6) + (i%6) + 6 (ret)

if i = 0, 5 + 21 + 6 = 32 therefore that is the minimum count

Instead of dividing the parameter i by 2 as in the previous example, I now have to divide it by 6 because I am assuming the DJNZ instruction takes 6 cycles. So we need to loop i / 6 times, and also add 0 to 5 cycles for the remainder (i % 6).

The remainder of my comments above pretty well apply to this example. I am leaving the original code, in case anyone actually has a 8051 with a two-cycle DJNZ instruction.

For counts of 20-31, you could create a subroutine with just one nop, that takes 12 cycles including the call and return:

void delay12(void)
{
    nop
}

For 20-23 counts, you would call it once plus adding 8 to 11 nops after the call (or a dummy jump to the next instruction which would eat up 6 cycles plus 2 to 5 nops -- so delaying 20 cycles would cost just four instructions plus the subroutine which is assumed to be used more than once.). For 24-31 counts, you would call delay12 twice, and add 0to 5 nops and/or a jump instruction as needed.

So to delay 20 cycles:

    acall delayl12
    jump next
next:
    nop
    nop

Related Solutions

Electronic – Using BLE, how fast can I get ~50 bytes of data from the master to a slave? How does that change when increasing the number of slaves present

You are going to be limited by what is known as the Connection Interval. This defines how often the Central (host) communicates with the Peripheral.

BLE uses a frequency-hopping scheme; two devices each send and receive data from one another on a specific channel, and then meet on a new channel sometime later. The time between the each hop is defined as the Connection Interval.

There can be a maximum of six packets (four for iOS) sent per connection interval, and each packet can have up to 20 bytes of payload. So in your case you would use up three packets, which would fit in one connection interval. You would waste the remaining packets.

According to the BLE specification, the allowable range for the Connection Interval parameter is from 7.5 ms to four seconds. With a minimum Connection Interval of 7.5 ms, you could send theoretically 133 connections per second. So your throughput would be 133*50*8 = 53.2 kb/s. This doesn't count the initial time to establish an initial connection while scanning for Peripheral BLE devices (which will be sending out Advertising packets on a periodic basis). Also, if an ACK has to be sent back to acknowledge receipt of the packets of data, this will require another Connection Interval, and thus halve the data rate.

Note: the Apple iOS guidelines limit the minimum connection interval to 20 ms, not 7.5. So you would only have a maximum of 50 connections per second. In that case the throughput would be 33*50*8 = 20.0 kb/s.

Here's some more information about BLE data rates.

Electrical – Introduce delays of alternating value for synthesis on hardware

From the other question, you already have a way of delaying the signal by between 1 and 4 clock cycles:

reg[3:0] bits;

always @ (posedge clk_27)
begin
    vsync_o <= bits[3]; //this was in the original code but actually makes it 5 cycles! But whatever.
    bits <= {bits[2:0], vsync};
end

So you can easily modify that to have two signals coming out, one with 4 cycles delay and the other with 2 cycles delay (hint: if bit 3 is 4 cycles of delay, which bi is 2 cycles).

Now you have two signals - one with 4 cycles of delay, and one with two cycles of delay. All that is needed then is a way to select which one is which. The word "select" implies you need a multiplixer - 2:1 in fact (two signals in, one out).

That means you also need something to drive the select line. Given it is 2:1, this is just a 1 bit control signal which you need to toggle every time there is a pulse. What do you know that can toggle? A flip-flop. To you need a flip-flop that toggles every time there is a pulse on Vsync.

The next question is, how do you detect a pulse on Vsync? Well, you see if there is a rising edge (or falling edge, or both). For that you need a rising edge detector. The structure is essentially:

reg delay;
always @ (posedge clock) begin
    delay <= in;
end
wire risingedge = delay && !in;
wire fallingedge = !delay && in;
wire bothedges = delay != in;

So you need the signal, plus the signal delayed by one clock cycle. You have vsync, and your bits register has the signal delayed by one clock cycle, so you already have all the signals you need to build an edge detector.

reg whichDelay;
always @ (posedge clock or posedge reset) beign
    if (reset) begin
        whichDelay <= 1'b1; //Start with a delay of 2 because on the first rising edge we will toggle.
    end else if (vsync && !bits[0]) begin
        whichDelay <= !whichDelay; //Toggle on each vsync input
    end
end

assign vsync_o = whichDelay ? bits[1] : bits[3];

And there you have it.

Best Answer

Edit

Related Solutions

Electronic – Using BLE, how fast can I get ~50 bytes of data from the master to a slave? How does that change when increasing the number of slaves present

Electrical – Introduce delays of alternating value for synthesis on hardware

Related Topic