I only looked at one of your proposed parts, the Murata LLL153C80G105ME21. I compared it with a same-value part in a larger package (GRM21BR71E105KA99#, 0805 size), the key improvement is in the available voltage rating. The 0204 part is rated for 4 V, while the 0805 part is rated for 25 V.
Even if your application only applies 4 V to the cap, take note of the capacitance change with applied voltage charts. The value of the 0204 part will be reduced to a bit above 30% of nominal (e.g. 0.3 uF instead of 1 uF) with 4 V applied. The 0805 part will still be at 95% of its nominal value with 4 V applied, and only loses about 45% of its value at 25 V applied.
So the smaller part can be used if you can accept its reduced temperature range, but its value will be reduced to just a bit more than the 0.1 uF value that has been typically recommended for use as the near-chip bypass capacitor over the past decade or so. If you really want 1.0 uF of bypassing, you'll still have add some larger parts in parallel with the suggested 0204 part.
On the other hand, if you can live with the low WV rating and you use this part in place of the "traditional" 0.1 uF 0402 part (in parallel with additional larger-value caps), you will gain a 3 - 4x increase in effective capacitance, so that is a substantial improvement.
Also, in a high-reliability application, you may want to use a package at least one size up from the minimum needed for the capacitor value and WV you are using. The smallest available size is pushing the limits of what the manufacturers can do, and can have reliability issues.
So you could fill a book with the answer to this question, in fact I think I have some on my shelf
Let’s run through your questions.
Should you use a 4 layer board instead of a 2 layer? I say absolutely yes, the cost argument to going 2 layer is a weak one at best compared to the advantages. Obviously it can be done, and is done, and in this devices case I see they placed VCC and GND right next to each other to make this easier to accomplish. So while I would go 4 layer, you can probably get away with 2 if you want.
Now without going too deep consider the goal of decoupling your processor. You are trying to supply a stable voltage to it despite the fact that it has dynamic current demands. When your processor is active for instance and its transistors switch they are requesting more current. This current is a change, an increase to the current draw at steady state. Now you have a changing current but where are you going to get that current from?
Well first there’s a little decoupling on the die, but then it tries to pull it through the package power and gnd pins. It wants to get at that capacitor you placed outside of your device but before it gets there it has to travel through the bond wires and or package substrate, out the pins, and down your traces. All of this contributes to the inductance, and ultimately the impedance of the path from the die inside the chip to the capacitor.
Why does this matter? Well because an inductor “resists” changes in current consequently its impedance increases as frequency increases. That’s a simplification, but what happens when you try to drag that change in current through your package and routing is that the inductance limits the amount of current you can get.
So your goal when placing your decoupling capacitors should always be to minimize the impedance, and thus the inductance from the pin to your cap. Now with a QFP package like this you may find the shortest possible connection is right at the pins, with a 4 layer board and a BGA it might be directly underneath, but in practice you can achieve even lower impedance on top layers as well.
Don’t ignore GND either. Current flows in a loop, it does you no good to have a super short path to VCC and long winding path to GND. So if you’re going 2 layer I would put the caps parallel, as close as possible to GND and VCC, route directly to the pins, and then bring power and gnd into the caps. Your goal is to minimize the loop size.
More 4 layer arguments and selection
The goal of what we call Power distribution network design is to minimize the impedance across the range of frequencies that your chip will request. To that end having a nice fat GND and VCC plane leading from your caps/part to your regulator will be a much lower impedance path for your lower frequency down to DC. Short of that fat wide traces are recommended if you can.
For this processor and your board I think 0.1uF 402s and 0805 10uF are a good choice. The smaller package size helps you have a smaller loop size. I can do 201 by hand, never bought a 1005, but it is easier with a microscope. For more complex designs we select a range of decoupling capacitors to cover the range of frequencies that the part might demand from us. Blindly doing this as in just using 0.1uF, 0.01uF, and 0.001uF as is often suggested can lead to nasty anti resonance peaks giving you high impedance and certain frequencies Again this is a simplification, but I don’t think digging into that here will help you. Interesting to note that placing the 10uF capacitors further away is ok as their role in this design is for the lower frequencies where the impedance caused by the trace inductance will be lower. Also the frequency range you can effectively decouple to is limited by the impedance of the package we discussed earlier.
Actual part selection
There are tons of capacitors out there, and usually we don’t make specific part recommendations. But I would look for a 402 0.1uF ceramic capacitor with maybe an X7R temperature coefficient, and a voltage rating double your VCC. Here’s an example of one I have on a BOM
OK long winded response I guess but sometimes if you get why something is done it makes it easier to decide how to do it.
So you say:
2 layer board: Seems ok for this, I always prefer a 4 as explained above. There are other benefits such as controlled impedance of traces, less noise, easier to pass emi. I don’t know what your board will do but without reference planes your traces return current will be forced to all follow whatever GND wires it can find. Gets a little messy.
GND pours: Meh it will help balance the copper on top and bottom layers for etching and re-flow, but really you’ll carve it up so much with traces it won’t do you that much good. Better to concentrate on getting power to that chip with as low an impedance as possible. Maybe you can figure out how to run VCC and GND as two copper pours?
Components on top: OK doesn’t really matter, in this case better to have decoupling on top than to go through vias to the bottom. If you are hand assembling it doesn’t really matter, but it would be cheaper to manufacture.
Traces on top and bottom: Definitely you probably won’t get away without this.
Decoupling: I talked about this at length.
Ah what else oh the ferrite, I didn’t see that in the app note. I’m assuming maybe it’s used to isolate one of the more sensitive VCC pins, maybe a PLL or an ADC. And it actually goes VCC supply -> Ferrite -> VCC Pin, with the cap from VCC pin to GND? If so that makes sense it’s probably just a little filter.
Got any questions? Just ask, it's hard to put everything you need to know about decoupling in one answer but hopefully this helps.
While not exactly what you're looking for, I have used power-management ICs to accomplish this. For instance, the TPS2113APW. I prefer this specific chip because it allows me to make dual-powered devices that can operate with either a wall-wart or off the USB, automatically preferring wall-power if it is available.
If you don't need dual-powered, you could use something like the MIC2545A
Ultimately, any capacitance "behind" the power-management IC (i.e. hooked up to the IC outputs) isn't "seen" by the USB; the bus only sees the capacitance "in front of" the IC (i.e. hooked up to IC inputs).
You still have to worry about inrush current - the "plus any capacitive effects visible through the regulator" part of the spec - but those ICs also have variable current limiting. Figure out the parallel resistances that you need to have 100 mA limitation and 500 mA limitation (and optionally n mA limitation if you want to limit wall-power), and then use FETs to short out the resistors as needed to enable various limitations.
Through these chips, I have attached PCBs with several hundreds of uF to the USB, and a DMM set to fast current max verified that the inrush during attachment did not exceed 100 mA.