I vote for DMA. It's really flexible in Cortex-M3 and up - and you can do all kind of crazy things like automatically getting data from one place and outputing into another with specified rate or at some events without spending ANY CPU cycles. DMA is much more reliable.
But it might be quite hard to understand in details.
Another option is soft-cores on FPGA with hardware implementation of these tight things.
I'd say you're dreaming. The main problem will be the limited RAM.
In 2004, Eric Beiderman managed to get a kernel booting with 2.5MB of RAM, with a lot of functionality removed.
However, that was on x86, and you're talking about ARM. So I tried to build the smallest possible ARM kernel, for the 'versatile' platform (one of the simplest). I turned off all configurable options, including the ones that you're looking for (USB, WiFi, SPI, I2C), to see how small it would get. Now, I'm just referring to the kernel here, and this does not include any userspace components.
The good news: it will fit in your flash. The resulting zImage is 383204 bytes.
The bad news: with 256kB of RAM, it won't be able to boot:
$ size obj/vmlinux
text data bss dec hex filename
734580 51360 14944 800884 c3874 obj/vmlinux
The .text segment is bigger than your available RAM, so the kernel can't decompress, let alone allocate memory to boot, let alone run anything useful.
One workaround would be to use the execute-in-place support (CONFIG_XIP), if your system supports that (ie, it can fetch instructions directly from Flash). However, that means your kernel needs to fit uncompressed in flash, and 734kB > 700kB. Also, the .data and .bss sections total 66kB, leaving abut 190kB for everything else (ie, all dynamically-allocated data structures in the kernel).
That's just the kernel. Without the drivers you need, or any userspace.
So, yes, you're going to need a bit more RAM.
Best Answer
Overhead does not relate to preemption. Preemption stops your process and runs another process. If you disable that for e.g. one CPU core, you have the CPU core alone for your process.
Still, there's an overhead if you do I/O through the Linux I/O functions instead e.g. controlling the I/O lines directly through your process. But, for any reasonably complicated I/O, you had to implement a function with similar overhead yourself and you can assume the Linux folks are better than you in that – more experience and more test cases.
So, the only way you could win the overhead race is with very basal I/O functions (e.g. bitbanging an exotic, fast protocol) you implement yourself in "your kernel" instead of running it through the various abstraction layers of the Linux GPIO subsystem and do the same in "your userspace process".