You are correct, and the diagram is wrong. You definitely do need wider adders as you move down the tree as shown. Each of the results from the first tier of adders is 34 bits wide, those from the second tier are 36 bits wide, etc. Third tier sums are 40 bits, and fourth tier sums are 48 bits.
By the time you get to the last (fifth) tier, you're adding together a 48-bit number (the sum of the high-order partial products) and a 32-bit number (the shifted sum of the low-order partial products).
Note that you get 1 LSB of the final product before the first tier of adders, another one from their output, two more LSBs from the second tier, and so forth. By the time you get to the input of the fifth tier, you already have 16 LSBs of the result, and the final adder gives you the remaining 48 bits.
While it is tempting at first glance to pull off MSBs in a similar manner, it is always possible to construct an example in which a carry in the final adder needs to propogate all the way to the MSB of the result.
Side note: I have a book Computer Architecture: A Quantitative Approach, 2nd edition, published in 1996 by the same two authors. I'm wondering whether my book is a predecessor of yours, or a completely different effort. It does not have the figure you show in your question in it.
Best Answer
That looks like a classic shift-and-add type multiplier. A proper explanation would require a lot of diagrams, which sadly I can't provide at the moment. In a nutshell, if you are multiplying A times B, you start with an 'accumulator' register set to zero. Then you put A in a shift register and iteratively do the following: shift A left one place and look at the bit that falls out; if it's a '1', then add B to the accumulator, otherwise, don't add B. If there are any more bits remaining in A, you shift the accumulator left one place and repeat. If A and B are 32 bit values, there will be 32 iterations of this shifting and accumulating. Note also that since 32 shifts will be involved, the accumulator has to be 64 bits, to accommodate the result.
An optimization that you sometimes see is to use the high 32 bits of the accumulator to hold the value of A; by the time any given bit in the accumulated product is needed, the remaining piece of the value of A will have been shifted out of the way. This just cuts down on the storage needed in implementation, but conceptually the same math is carried out.
I may have flubbed a detail in that hasty explanation, but the basic gist of it should be there.