I'd say you're dreaming. The main problem will be the limited RAM.
In 2004, Eric Beiderman managed to get a kernel booting with 2.5MB of RAM, with a lot of functionality removed.
However, that was on x86, and you're talking about ARM. So I tried to build the smallest possible ARM kernel, for the 'versatile' platform (one of the simplest). I turned off all configurable options, including the ones that you're looking for (USB, WiFi, SPI, I2C), to see how small it would get. Now, I'm just referring to the kernel here, and this does not include any userspace components.
The good news: it will fit in your flash. The resulting zImage is 383204 bytes.
The bad news: with 256kB of RAM, it won't be able to boot:
$ size obj/vmlinux
text data bss dec hex filename
734580 51360 14944 800884 c3874 obj/vmlinux
The .text segment is bigger than your available RAM, so the kernel can't decompress, let alone allocate memory to boot, let alone run anything useful.
One workaround would be to use the execute-in-place support (CONFIG_XIP), if your system supports that (ie, it can fetch instructions directly from Flash). However, that means your kernel needs to fit uncompressed in flash, and 734kB > 700kB. Also, the .data and .bss sections total 66kB, leaving abut 190kB for everything else (ie, all dynamically-allocated data structures in the kernel).
That's just the kernel. Without the drivers you need, or any userspace.
So, yes, you're going to need a bit more RAM.
While this may not be a complete, independent and ready-to-use answer, but I think you can get some neat ideas, implement (or port) it on Cortex-M0. Here I assume that from a computation power and resource standpoint, a Cortex-M0 has more to offer, than the popular Atmel 8-bit AVR (ATmega328P) running on an Arduino.
Here are 2 projects, that manage to use the PWM pin of Arduino and an RC-filter circuit to play out synthesized speech. Of course, we are not looking at hi-fidelity audio, but something that is recognizable. Also do note, that apart from the need for a PWM capable pin, your micro-controller might be very busy during the synthesis, so much so that, it might spend most of it's cycles doing it. Software PWM would put further strain.
Now for the 2 projects:
PS> Personally, I've not implemented them, but looked at them for a project.
Best Answer
The behavior of the core does depend on the implementation. The Flash is not integral to the ARM core, and as such, every vendor implements it differently. Typically, during the erase/write process one would execute from RAM, and execution should not be affected.
On the STM32, I believe reads from flash stall while erase/write cycles are ongoing. This would cause the core's execution to stall until the operation completes. With some of the flash configurations, I believe you can continue to execute/read flash and it will only stall when you access the part of the flash that you are eraseing/programming.
I've used other Cortex M's where you must execute from RAM while modifying flash contents otherwise you will encounter a bus fault (and likely a system crash if your bus fault/hard fault handlers are in flash). Some micros with large amounts of flash implement it as two independent flash arrays, and these typically allow for full access to one bank while operating on the other.
You would need to refer to the documentation for your specific part to see the limitations of execution while modifying flash contents.