Let's say that we want to do a good job of testing this, but without going through the entire 2^32 space of possible operands. (It is not possible for such adder to have such a bug that it only affects a single combination of operands, requiring an exhaustive search of the 2^32 space, so it is inefficient to test it that way.)
If the individual adders are working correctly, and the ripple propagation between them works correctly, then it is correct.
I would giver priority to some test cases which focus on stressing the carry rippling, since the adders have been individually tested.
My first test case would be adding 1 to 1111..1111 which causes a carry out of every bit. The result should be zero, with a carry out of the highest bit.
(Every test case should be tried over both commutations: A + B and B + A, by the way.)
The next set set of test cases would be adding 1 to various "lone zero" patterns like 011...111, 1011...11, 110111..111, ..., 1111110. The presence of a zero should "eat" the carry propagation correctly at that bit position, so that all bits in the result which are lower than that position are zero, and all higher bits are 1 (and, of course, there is no final carry out of the register).
Another set of test cases would add these "lone 1" power-of-two bit patterns to various other patterns: 000...1, 0000...10, 0000...100, ..., 1000..000. For instance, if this is added to the operand 1111.1111, then all bits from that bit position to the left should clear, and all the bits below that should be unaffected.
Next, a useful test case might be to add all of the 16 powers of two (the "lone 1" vectors), as well as zero, to each of the 65536 possible values of the opposite operand (and of course, commute and repeat).
Finally, I would repeat the above two "lone 1" tests with "lone 11": all bit patterns which have 11 embedded in 0's, in all possible positions. This way we are hitting the situations that each adder is combining two 1 bits and a carry, requiring it to produce 1 and carry out 1.
Delay loops are nasty, and you shouldn't be using them very often.
Alternatives include timers, periodic interrupts and such like.
That said, they're occasionally useful. Be sure to take care of the WDT somewhere during the looping or you could have some nasty problems.
There's an assembly language code generator here that generates cycle perfect assembly delay routines (delay time fixed at assembly, not variable).
There's no reason why you couldn't make a cycle-perfect programmable delay generator in asm but I think it would be a bit complex. Interrupts would cause inaccuracy, but often you only need to guarantee a minimum time and if it runs (say) 11ms rather than 10ms it's not a big deal.
Best Answer
What you have drawn here is what we could call a logical schematic. It is a schematic representing concepts, not an actual physical component. You cannot estimate propagation delays with this.
Let's take an example. If you have an FPGA, you usually describe a concept using a language such as VHDL or verilog. Then you can write test benches to test the logical behavior or that concept circuit. But this is just a conceptual simulation for ideal components. You can feed a 1THz signal through your gates and it pass pass just fine. There is no rise-time, no propagation delay, etc. Now once you are confident about your circuit, you run perform a synthesis step. The tool will map your logical concept to actual hardware. FPGA are usually implemented with LUTs (multiplexer, more or less) that have say 4 inputs. Any combinatorial 4 input circuit with a single output can be described using a 4 input LUT. The FPGA will map your design to physical hardware available in the FPGA. Finally, you can run an RTL simulation, which implies running your test bench on a selected chip, with selected speed specification. At that point, you simulation won't work with your 1THz signal, because your FPGA is very unlikely to be able handle to handle that...
Beyond using unrealistic signals, your process or hardware will have limited resources. Again, on an FPGA, your tool will usually try to map adders to physical hardwired adders on your FPGA. If you try to implement more adders that those available, the tool should implement remaining adders using digital fabric (LUTs or whatever ressources it sees fit to use). Those remaining adders will probably be less efficient/fast than the hardwired ones. So the performance and maximum delay, clock skew, etc are bound to actual hardware you implement on.
You say that you use Cadence. Fine, but your simulator must use a model to simulate your gates. It will be either a logical (ideal) model or a physical one. If it is a logic one, there are probably no propagation delays or a default one which might no fit the physical implementation you would get in real life. If it uses a physical model, then your simulation will reflect result you would get on that physical part, nothing else.
Consider standard discrete logic gates chips such as QUAD 2 input NANDs or chips of the like. You have multiple possible families (RTC, DTL, TTL, ECL, PECL, etc.). Each of those logic families have difference performance. It could be the same conceptual circuit, but maximum speed, propagation delay, power consumption, etc. will be very different.
In summary, a physical propagation delay/speed simulation will be at most meaningless if you do not have models of the actual hardware you plan to implement your circuit on, whether it is discrete logic, FPGA, ASIC or whatever. If your are only toying, then such an analysis is bound to whatever model you want to use.
Just in case you are wondering why I'm not directly answering your question from a theoretical point of view, it is because some hardware won't implement actual discrete logic gates, such as an FPGA (which might use logic gates or an hardwired adder), so a theoretical analysis won't give you the whole picture.