You are assuming the capacitor will be a true short, which it won't be, the voltage will never rise infinitely fast - remember there is inductance and resistance in real life to limit things. If we look at the formula for current through a capacitor:
\$ I = C \cdot \dfrac{dV}{dt}\$
We can see that I depends on the cap value and how fast the voltage source rises. The formula does not include the ESR though, so we have to allow for this separately.
This means that both the cap value/rise time and/or the ESR can limit the peak current - roughly meaning if the rise time is fast enough, the peak current will be limited by the ESR. If the result of the formula above is much lower than V/ESR though, then it will be limited by the capacitance value, or voltage rise time.
You can see both effects at once - initially at turn on with a fast rise time, there will be voltage divider effect between the wiring resistance and the ESR, then the capacitor charges as it would normally.
If we look at a couple of examples, using the same risetime of 1ns to 1V, but different ESR/Cap Value/Wiring Resistance.
With a 100uF Capacitor, 1mΩ ESR, 1mΩ Rwiring:

With no ESR, we would expect I = 100uF * (1V/1ns) = 100kA. However, the resistance of the wiring and ESR of the capacitor divide to limit things to 500A initially, then the capacitor charges to 1V.

Now if we reduce the capacitor value to 10pF, but keep everything else the same, the current is limited by the capacitance value: I = 10pF * (1V/1ns) = 10mA:

The ESR has no effect here.
Now if we simulate a more realistic situation with the 100uF capacitor, wiring inductance of 100nH and increased resistance of 10mΩ wiring resistance and 50mΩ ESR we get something like this, where everything works together to limit peak current:

These are very simplistic simulations, you could go on and add the capacitors ESL, leakage current, wiring parasitic capacitance, etc.
About the capacitors on the input side of the regulator, without limiting they will be subject to large currents at power up regardless of the slew rate limiting on the output side.
Basically, the circuit works by using the C2-R3 combination as a model (or analog) for the charging of C1. Instead of sensing the current in C1 directly, R3 senses the current in C2, and it is assumed that this value is proportional to the current in C1, which is true as long as the voltage across R3 is a small fraction of the total.
Whenever there is a drop across R3 because of current through C2, the drive to the pass transistor is reduced.
The Thevenin equivalent of the base drive to Q1 is a 1.5 V and 120 kΩ, so if the voltage across R3 ever rises as high as 1.5V - 0.6V = 0.9V, Q1 is cut off altogether, removing the drive from M1 as well. This would occur at a C2 current of \$\frac{0.9 V}{10 k\Omega} = 90 \mu A\$, which would correspond to a current in C1 of \$90 \mu A \cdot\frac{40000\mu F}{12\mu F} = 300 mA\$.
By adjusting the resistor and capacitor values, you can change that limiting current value. Note that the total current through M1 (and the power supply) is the C1 charging currrent plus the rising load current through R7, so pick the limit value accordingly.
C3 just serves to filter noise at the base of Q1. I would be inclined to make sure that the time constant associated with this (C3 times the Thevenin resistance calculated above) is less than 10% of the C2-R3 time constant. In other words, I would reduce C3 to 0.1 µF or less.
I am not sure why can we set C2 and R3 into relation to get the current through C1?
The basic equation of a capacitor says that the current through a capacitor is proportional to the rate of change of the voltage across it, and also to its capacitance:
$$i(t) = C \frac{dV(t)}{dt}$$
As I said, the assumption is that the voltage across R3 is "small", which means that V(t) is essentially the same for both capacitors. This means that the current through each is directly proportional to its capacitance.
In this specific case, the voltage across R3 can be as high as 0.9 V, which is 10% of the supply voltage, so the proportionality isn't as precise as it could be, but it's good enough for this purpose.
Regarding C3, it seems to be important to prevent an initial inrush peak due to Q1 switching "fully" on immediately. I have to choose a value above 1uF to prevent that.
Yes. I neglected to take into account the very high resistances connected to the gate of M1. These are needed in order to keep the maximum current through Q1 (which also flows through R3) to a small fraction of the C2 current, but they severely limit M1's bandwidth (its ability to switch quickly). In particular, it turns off very slowly because the gate charge must be dissipated through R1 (220 kΩ) alone.
Therefore, it makes sense that having the voltage reference (R4, R5, R6 and C3) start up slowly would help avoid that initial spike in current.
Regarding the choice of M1 — as long as it can handle the voltage and the current, and that you control the maximum VGS it sees by appropriate selection of R1 and R2, there's really nothing special required.
Pay attention to the SOA (safe operating area) diagram in the datasheet. You can plot some sample voltage and current values from the simulation in order to make sure it stays in the safe area.
Also, be sure give it an adequate way to dissipate the pulse of heat associated with the charging surge of C1 without having its temperature rise too high.
Best Answer
Decades ago I was tasked with analyzing the power dissipation of switch-cap circuits.
As part of that, I wrote the integrals of energy in the switch and in the capacitor.
Turns out the SWITCH(RESISTOR) energy dissipation is exactly the energy stored in the capacitor.
Thus if you have a large audio amplifier with 100,000 uF capacitors on +50v and -50v rails, the energy stored in each will be
Energy = 1/2 * C * V^2 = 0.5 * 0.1Farad * 50 * 50 = 2500 * 0.1 * 0.5 = 125 joules
Assume your surge-limited resistor dissipates that energy in 1/120 second.