So as the title says, I need exact delays, not too long, ideally in the range of 0 – 350 CPU clocks, but if anything would work in narrower range the absolute minimum range is 20 – 127 CPU clocks.
So these are under or just above single micro-second delays (50MHz CPU clock), relatively short several clocks to several tens of clocks.
The problem with polling a timer, is that the precision results in a step of 7 clocks at max, depending on the implementation, for example:
- while(!TF0) {}
While, not, and bit operator, all together take 7 clocks. So if I call anything in between the 15-21 clocks, it will result in the flat delay of 21 clocks … - Using interrupt on timer and CPU stop mode – Gives good results for over 50 clocks, probably depends on current CPU condition, so sometimes goes way beyond 50 clocks, into 100 clocks range due to the interrupt and wake-up latency, but anything below again flat 50 (or 100) CPU clocks.
- Using switch-case, with for example 30 entries for 30 delays with 1 clock increment, having different number of NOPes as a delay, results in compiler optimization that makes it unpredictable in terms of timing and majorly too long again, over 100 clocks. This renders the approach unusable.
- I am planning to try table of pointer to functions with different number of NOPes. But before I try I already see two problems to that approach: a. it will require a lot of memory and I have 1k left; b. the latency of a void function(void) in and out is around 18 clocks so it is very very tight to meet the absolute minimum of 20 clocks I need …
How to approach this type of problem? Any ideas will be more than welcome?
By the way, I run it on C8051F38x microcontroller from Silicon Labs, using C51 and Keil to code and compile if that matters.
The code that emerged as a partial solution, seems like it is follows the same timing while loop in C does, and the "djnz" instruction takes 5-6 cycles of CPU, instead of the datasheet stated 2/4.
ACC_save= ACC;
ACC = counter102;
P0b3 = 1; // Start the Pulse
#pragma ASM // Precice DELAY using assembler
clr C // ; 1 Clear Carry
rrc A // ; 1 C = 1 if odd
jnc even // ; 2 or 4 extra 2 cycles if branch taken (spoils cache)
nop // ; 1
nop // ; 1
clr C // ; 1
even:
subb A,#4 // ; 1
mov R7,A // ; 1
loop:
djnz R7, loop // ; supposed to be 2, but practically takes 5 to 6 cycles!
#pragma ENDASM`
P0b3 = 0; // Stop the Pulse
EDIT
Thank you very much every one for great input, I couldn't imagine that the flow of ideas could be so positive and most important productive. So my deep appreciations to all who contributed, and will contribute in the future.
So after your valuable input guys, and great ideas I came up with something that works for me, to some extent. The code is below:
void delay(unsigned char delay_time) {
switch (delay_time)
{case 8: goto Q08;
case 9: goto Q09;
case 10: goto Q10;
case 11: goto Q11;
case 12: goto Q12;
case 13: goto Q13;
case 14: goto Q14;
case 15: goto Q15;
case 16: goto Q16;
case 17: goto Q17;
case 18: goto Q18;
case 19: goto Q19;
case 20: goto Q20;
default : goto Q00; }
Q19: PORT_ACTIVE(1); // 2clk
Q17: PORT_ACTIVE(1); // 2clk
Q15: PORT_ACTIVE(1); // 2clk
Q13: PORT_ACTIVE(1); // 2clk
Q11: PORT_ACTIVE(1); // 2clk
Q09: PORT_ACTIVE(1); // 2clk
_nop_(); // 1clk
goto EXIT1; // Skip the Even delay part
Q20: PORT_ACTIVE(1); // 2clk
Q18: PORT_ACTIVE(1); // 2clk
Q16: PORT_ACTIVE(1); // 2clk
Q14: PORT_ACTIVE(1); // 2clk
Q12: PORT_ACTIVE(1); // 2clk
Q10: PORT_ACTIVE(1); // 2clk
Q08: PORT_ACTIVE(1); // 2clk
Q00: // 0clk
EXIT1:
return; // Exit from the function takes 7 clocks
} // END of function delay
// Continued execution after the delay function
PORT_ACTIVE(0); // 2clk
So PORT_ACTIVE(x) is a #define function that activates the pulsing port. Since I have all the time I need before I commence the pulse, I was able to squeeze in most of the overhead related with decisions before the actual activation of the port. Then, the return instruction is pretty much takes always the same amount of time so I am now able to generate a pulse with minimum of 8 clk cycles wide, and up to 20 cycles. I am now extending it up to 100 clocks, at the expense of the storage memory available of course. And so this solution is in fact thanks to the idea from JimmyB to drop the pulse activation into the function and not before it, and of course thanks to the great ideas by TCROSLEY, of how to manage the odd and even delays, its' just that switching to assembly is not really friendly to the debugging experience, and the code does much more than simpkle delays, and so I prefer to stay in C.
One more note, is that as soon as I finished celebrating a working solution, I hit the next problem.
SECOND PROBLEM
I need to execute a second pulse back to back to the first one with independent width. So no overhead for the second pulse, otherwise it will end up with varying width. It pretty much puts me back into the spot I was before, since the second pulse is again limited to the 6 cycles bottleneck of the while loop, unless there is a way to put the branching overhead for the second pulse before the first pulse …
Any ideas on that?
Best Answer
As others have mentioned, this is best done in assembly. Here is my original attempt at coding this, when I thought the jump instructions took either 2 or 4 cycles (see Edit below for the revised version).
It assumes a call is made like ACALL(nn), where nn is a constant or a variable in a byte variable, so that the parameter can be passed using a one cycle MOV A,#n instruction for example. The minimum timing you can do is 20 clocks, as you asked for.
There is no check that the parameter is greater than or equal to 20, any values less than 20 will give incorrect timing.
The mov instruction and acall will take 5 cycles. First off, the count (i) is divided by two to account for the DJNZ instruction taking two cycles. Then the count is adjusted to add a cycle if i is odd. Finally a fixed value is subtracted so the value in the register to be decremented (R7) is in the range 1, 2, 3 ... R7 is then decremented in a tight loop (two cycles per count). There is a fixed cycle count of 6 for the return.
If you have to use a LCALL instead of a ACALL, the minimum timing you can do will be 21 clocks instead of 20, and you will need to delete the two nop's after the jnc instruction. You have to use either all ACALL's or LCALL's, you can't mix them.
I would avoid using C to call the function unless you can guarantee the compiler doesn't add extra overhead. Also, I'm using R7 as a scratch register; your compiler manual will tell you what registers can be used inside an assembler function without having to save them (if any).
This also doesn't account for disabling and re-enabling interrupts, if necessary, to guarantee the timing routine will not be interrupted.
The behavior of the jump instructions are based on the datasheet for the C8051F38x as I understand it (in terms of when the instruction cache is spoiled or not). This may be different for other versions of the 8051.
Finally, I haven't shown the syntax for jumping into in-line assembly and back out again. The subroutine could also be put into a separate file and assembled.
Edit
Since I wrote the original code, the OP has informed me that the number of clock cycles for a jump in his 8051 is 5 or 6, not the 2 or 4 stated in the datasheet I read. So I have re-written the routine to take this into account. Unfortunately, this bumps the minimum cycle count that can be timed to 32 instead of 20. So if counts between 20 and 31 are absolutely needed to be handled, some special purpose code will need to be written specific to that case (see below).
Instead of dividing the parameter i by 2 as in the previous example, I now have to divide it by 6 because I am assuming the DJNZ instruction takes 6 cycles. So we need to loop i / 6 times, and also add 0 to 5 cycles for the remainder (i % 6).
The remainder of my comments above pretty well apply to this example. I am leaving the original code, in case anyone actually has a 8051 with a two-cycle DJNZ instruction.
For counts of 20-31, you could create a subroutine with just one nop, that takes 12 cycles including the call and return:
For 20-23 counts, you would call it once plus adding 8 to 11 nops after the call (or a dummy jump to the next instruction which would eat up 6 cycles plus 2 to 5 nops -- so delaying 20 cycles would cost just four instructions plus the subroutine which is assumed to be used more than once.). For 24-31 counts, you would call delay12 twice, and add 0to 5 nops and/or a jump instruction as needed.
So to delay 20 cycles: