I vote for DMA. It's really flexible in Cortex-M3 and up - and you can do all kind of crazy things like automatically getting data from one place and outputing into another with specified rate or at some events without spending ANY CPU cycles. DMA is much more reliable.
But it might be quite hard to understand in details.
Another option is soft-cores on FPGA with hardware implementation of these tight things.
You are counting many periods of a fairly short timer, each expiration of which requires you to wake up. To get maximum power savings, you need to put the timing entirely in hardware, so that you only generate an interrupt at the end of the desired period, or at least fairly infrequently. You should probably look at using a larger value in the timer register, and possibly using a prescalar divisor. (Even if you are waking up only a few times a second and so not wasting much power, this will make it a pain to measure power consumption unless you use a scope to measure across a series resistor)
However, even then you would likely still be running on a fast PLL clock, which consumes a lot of power. So for more power savings you will want to switch to a low power clock source such as an internal or external KHz-range backup clock, or at least disable the PLL and possibly crank up any system-wide clock prescale divider.
And you also likely have many parts of the chip powered up and clocked which are not needed in sleep mode - likely you want to power down and de-clock everything except the GPIO, interrupt controller, counter module and RAM.
Additionally, you need to audit your circuit design for any cases where you are driving a signal against a pullup or pulldown resistor, or even letting a digital input float in the vicinity of a logic transition.
Getting a system down to microamp standby modes can be an involved project, as you eliminate one power leach after another. Also watch out for debugger, serial, USB, etc connections - not only as potential loads, but also potential stealth power sources for the system to get power while bypassing whatever you are measuring with (yes, you can get energy from data pins).
Best Answer
Once again, it seems I did not utilize google enough.
Section 10.6.7 of the given datasheet also described on this ARM information center page describes the Exception entry and return process.
So the answer to my question is that faulting instruction address resides on the stack at an offset of
+0x18
after entering the exception handler.