This is a carry select adder. The speed benefit comes from a 2:1 multiplexer selecting M bits being faster than a ripple carry adder of M bits. (Each bit from the multiplexer is only dependent on sum bit and the select bit (i.e., the carry-in), whereas the more significant bits of a ripple carry adder are dependent on the carry-out from the previous bits. The multiplexer delay would be the delay of a one-bit multiplexer plus the 1:M fan-out delay for the carry-in from the previous block into M bits.)
The more significant chunks of M bits are added in parallel with the first M bits, so their delay is zero (the additions will be completed at that same time that the first M bit addition is completed since the inputs are available at the same time and the adders are the same).
Since the latency of a simple carry select adder is the sum of the multiplexer latencies and the latency of the M-bit ripple carry adder, making M larger can reduce the number of multiplexers but increases the ripple carry adder delay. For a 64-bit adder, this latency would be Lrca + ((64/M)-1) * Lmux
In addition, since the multiplexer adds some delay (and the fan-out delay for the multiplexer does not increase linearly with M — e.g., a 7-bit multiplexer would not be 16.7% slower than a 6-bit multiplexer), ripple carry adders for more significant bits may be able to use more bits without adding delay.
Ignoring the complexities of real implementations such as size-dependent multiplexer delay, if the a one bit full adder has N times the latency of a multiplexer, then a minimum-latency 64-bit carry select adder with all stages having the same size would balance ripple carry delay (Lrca = N * M * Lmux) and multiplexer delays ((64/M -1) * Lmux). Finding N * M * Lmux = (64/M -1) * Lmux is solving the quadratic equation N * M2 + M - 64 = 0. If N = 2 (a one-bit full adder has twice the latency of a multiplexer), M = 5.41. In this case M = 5 and M = 6 both give a total latency of a 64-bit adder of 22 multiplexer delays (13 stages of 5 bits: 2 * 5 + 12; 11 stages of 6 bits: 2 * 6 + 10) or 11 one-bit adder delays.
Under the same assumptions but allowing adders of different widths, the widths would be 5, 5, 5, 6, 6, 7, 7, 8, 8, 7. (The second stage is the same width as the first stage, the third stage could be 5.5 bits but that is not an integer number of bits.) This would give a total delay 19 multiplexer delays (10 stages with an initial 5 bit ripple carry adder delay: 2 * 5 + 9) or 9.5 one-bit adder delays.
If the latency of a one-bit full adder equaled the multiplexer latency (or more correctly the latency of a M-bit carry ripple adder equaled M times the latency of a M-bit 2:1 multiplexer including the fan-out delay), then the quadratic equation to solve would be M2 + M - 64 = 0; M = 7.52, so 8-bit stages would be minimal latency (8 + 7 = 15 multiplexer delays). As the number of bits in the entire adder approaches infinity, the ideal M for a simple carry select adder approaches the square root of the number of bits (the linear term becomes insignificant).
(The multiplexer delay might also be exploited to allow more significant bits to arrive slightly later. Bits for the third stage could arrive one multiplexer delay later without increasing the latency of the 64-bit adder. A similar technique, width pipelining, divides the adder into multiple pipeline stages but with single cycle latency for dependent operations if they can start with just the least significant bits of the operands. Intel's Pentium 4 implemented such a width-pipelined (also called staggered) ALU.)
Note that such carry-select adders are not the lowest latency adders. For example, using a carry-lookahead adder instead of a ripple carry adder would reduce the delay in generating the carry bit, providing the select signal for the multiplexers earlier.
Your question is rather broad.
To start with: the good new is that you don't need to buy an FPGA board to find out how big your design is. The development tool will tell you. It will also tell you if you exceed the number of resources (Memories, LUTs, Registers, DSPs or I/O pins.) If it does not fit, you select a bigger FPGA in the tool setting, until you get to the really BIG ones you probably can't afford because they are e.g. $15000 each.
The second good new is that most FPGA development tools are free, at least for the smaller FPGAs. And 'small' is still rather big.
The not-so-good new is that HLS is still in development. We ran some tests and they still markedly under-perform compared to Verilog or VHDL.
But for just comparing algorithms they are probably good enough.
Now as to "flow, parallelism" you get into difficult areas. The more logic in parallel or more pipeline stages the faster the algorithm will run. But also the resources utilization (area) will go up. It is one of the many tasks of an HDL designer to try to find a balance between speed and area.
Getting to "array width/length". That is the fastest way I found to fill an FPGA. I recently designed code for convolution matrices. It was a module which had the matrix width/height as parameters. With little trouble I managed to fill 60% of the FPGA with that module alone (It was supposed to use 15%).
Best Answer
Your implementation is not correct:
This could be a more or less nice rewriting, which is more generic. It only depends on a, b and c's length. So you could reuse this code for 16 or 64 bit, or whatever.