1 and 2: there's no hardware floating point unit on the M0, so it depends on your compiler alone. Expect on the order of tens to possibly low hundreds of cycles for single precision, with full IEEE compatibility. As for double precision, you're probably looking at high hundreds, maybe even breaking the thousand-cycle barrier, again assuming full IEEE compatibility.
3: single cycle.
I think @markt is certainly in the right place: Toolchain, peripherals, packages, devkits.
I'll add a few, and maybe take off a few. Toolchain is certainly important, but FREE may or may not be. Sometimes, working without real support can be more expensive than you think it is, and using a reasonable commercial package may well be worth it for a given situation. Sometimes, being able to pass a thorough license audit is important as well, and using a free tool with a restrictive license can bite you later.
A good CMSIS library to support the microcontroller is a must for me. CMSIS -- Cortex Microcontroller Software Interface Standard -- arm.com/products/processors/cortex-m/… -- is a hardware abstraction layer for Cortex-M series microcontrollers. In theory, if a library is CMSIS compliant, it's vendor-independent, and its easier to swap different families, and you don't have to relearn an environment from the ground up to be able to use the library. One of the attractive aspects of the ARM Cortex environment is the ability to change platforms without a whole bunch of sweat. If you pick a platform that doesn't buy into the CMSIS structure, you may not be able to move around as conveniently.
For me, cheap and convenient dev boards is a must, but this may or may not be as important as some other things (I think the STM32 series has amazing devboards). If the family has very convenient and cheap dev boards, then you're more likely to find help from a larger userbase if you need it. Also, these chips tend to be in SMT packages. When you inevitably blow up a chip, or a port on a chip, or a bit on a port on a chip, replacing the chip is a PITA involving SMD rework. If you can purchase two or three boards at $10-$15 each, and replace them as you bust them, you won't even THINK about doing that SMD rework!
Think "Extras". You may need something above and beyond what is considered a "peripheral". For example, maybe you have heavy bluetooth needs, and you might choose to go with Nordic Semiconductor for that kind of support. You might consider some other things, like how easy is bootloading, etc.
Think Documentation. I've been a bit less than impressed with how hard it can be to wade through some of the STM documentation.
Best Answer
Here are a couple of pointers that I can provide. The specifications that NXP is providing is for their entire chip (core, memory, peripherals). The specification that ARM provides is based on just the core. As the numbers are derived differently it's really hard to do the comparison.
So, I propose we step back and look at two devices. An NXP M0 based MCU, and an MXP M3 based MCU.
For the M0 based MCU let's look at the LPC1111. When this MCU is executing an busy idle loop it will consume 3mA of current at 12MHz clock rate. This yields 250uA/MHz, which at 3.3V is 825uW/MHz.
For the M3 based MCU let's look at the LPC1311. When this MCU is executing the same busy idle loop it will consume 4mA of current at 12MHz. Yielding 333.3uA/MHz, which is 1.1mW/MHz.
If we look at a MSP430C1101 MCU (16-bit) we'll see it's going to use 240uA at 1MHz when the voltage is 3V. This yields 720uW/MHz.
Next, let's turn to the ATMega328 (used in Arduino Uno). We see 200uA used at 1MHz with a voltage of 2V. This yields 400uA/MHz.
It should also be noted that the MSP430 and AVR are spec'ed differently. Their power consumption is given at 1MHz, where as the M0 and M3 are given at 12MHz. This means the M0 and M3 have inefficiencies of scaling up to 12MHz baked into their numbers.
These values are all active current consumption numbers. If you look at the current consumption when the device is in a sleep state you see orders of magnitude less power being used. The advantage that the 32bit M0 provides is that it can get a lot more work done in less time than the 8 and 16 bit MCU. This means for a given workload it will spend a lot more time in a sleep state. The M0 in the hands of a good engineer will often times get far better power efficiency than an 8-bit MCU in the hands of a less skilled engineer despite the differences in active power consumption.
From my experience the M0 is so close to 16 and 8 bit active power consumption that you can make up for a lot of the differences in application. Also, many times the power consumption of everything you have hanging off of the MCU dwarfs the MCU. So, for a lot of applications tackling the efficiency of the MCU isn't the most important thing.
I hope that helps. It is a long way of saying that power consumption is a bit worse, but you get a lot more done with those clock cycles than other chips would. So, it really depends on your application.