Essentially, yes. The clock divider (a.k.a. System Clock Prescaler) is the first device in the clock chain after the input. All other clocks are derived from the output of this prescaler which means everything with the exception of the clock input and prescaler will run at 1MHz (and so draw the same power as if driven by a 1MHz clock input).
If timing isn't critical, you can also use the internal 8MHz oscillator with 8:1 prescaler. In this way you eliminate the power consumed by an external crystal or resonator circuit, and in the external oscillator circuitry.
However if you do want accurate timing (the internal Osc. is +/-10% unless calibrated), then you should use a crystal and configure the clock source to be the "Low Power Crystal Oscillator" (as opposed to the "Full Swing" option) in the ATMega fuse settings.
1MHz crystals are not that common, but there are options such as a 1.8432MHz which is a standard 'baud rate' crystal (so named because you can accurately divide it down into standard serial baud rates, e.g. 9600). These are very common and quite cheap. This will reduce the power consumed in the oscillator circuitry and clock input. Granted that is not 1MHz, but if at start up you write to the CLKPR
register you can enable a 2:1 prescaler to get 921.6kHz which is close. If you are doing anything with serial comms this is well worth looking in to.
If you want an accurate 1MHz, then 2MHz crystals are also quite common. Similarly if you are using timers and want to be able to time a millisecond accurately you can also go for 2.048MHz (again common) which gives you a power of two division down to 1kHz.
The prescaler register description states, that the input clock is divided by the register value + 1. So if your input frequency is 84 MHz and you want the timer to count at 1 MHz you have to program 84-1 to the PSC register to get a divider of 84 and thus a counter clock of 1 MHz.
The internal PSC counter is not accessible, so there is no work around for the 16 bit limitation.
Program the ARR register with 39999, the overflow will occur on the next (the 40000th edge).
Best Answer
To answer your first question.
Basically it is all the same with all microcontrollers and your calculation was correct. In your example, with a 16 bit Timer and, $$f_{\text{SystemClock}} = f_{\text{timer}} = 8MHz$$
As you said we have a tick in every, $$T_{\text{timer}} = \frac{1}{f_{\text{SystemClock}}} = \frac{1}{f_{\text{timer}}} = \frac{1}{8MHz} = 0.125μs$$
With a 16 bit Timer it means,
$$ticks_{\text{max}} = (2^{16} - 1) = 65535$$
ticks. So the timer will overflow in every,
$$t_{\text{overflow}} = ticks_{\text{max}} \times T_{\text{timer}} = 65535 \times 0.125μs = 8.191875ms$$
You can count overflows to get a specific delay. Now if you want to change toverflow's value
This way a lot of delay value can be achieved.