I'd say you're dreaming. The main problem will be the limited RAM.
In 2004, Eric Beiderman managed to get a kernel booting with 2.5MB of RAM, with a lot of functionality removed.
However, that was on x86, and you're talking about ARM. So I tried to build the smallest possible ARM kernel, for the 'versatile' platform (one of the simplest). I turned off all configurable options, including the ones that you're looking for (USB, WiFi, SPI, I2C), to see how small it would get. Now, I'm just referring to the kernel here, and this does not include any userspace components.
The good news: it will fit in your flash. The resulting zImage is 383204 bytes.
The bad news: with 256kB of RAM, it won't be able to boot:
$ size obj/vmlinux
text data bss dec hex filename
734580 51360 14944 800884 c3874 obj/vmlinux
The .text segment is bigger than your available RAM, so the kernel can't decompress, let alone allocate memory to boot, let alone run anything useful.
One workaround would be to use the execute-in-place support (CONFIG_XIP), if your system supports that (ie, it can fetch instructions directly from Flash). However, that means your kernel needs to fit uncompressed in flash, and 734kB > 700kB. Also, the .data and .bss sections total 66kB, leaving abut 190kB for everything else (ie, all dynamically-allocated data structures in the kernel).
That's just the kernel. Without the drivers you need, or any userspace.
So, yes, you're going to need a bit more RAM.
This depends on the device.
RAM can be built faster than Flash; this starts to become important in about the 100MHz range.
Simple microcontrollers
Small slow microcontrollers execute directly out of Flash. These systems usually have more Flash than SRAM too.
Midrange systems
Once your device gets faster then the situation is a little different. Midrange ARM systems may do that as well, or they may have a mask ROM bootloader that does something smarter: perhaps downloading code from USB or external EEPROMs into internal SRAM.
Large systems
Larger, faster systems will have external DRAM and external Flash. This is typical of a mobile phone architecture. At this point, there is plenty of RAM available and it's faster than the Flash, so the bootloader will copy and execute it. This may involve shovelling it through the CPU registers or it may involve a DMA transfer if a DMA unit is available.
Harvard architectures are typically small so don't bother with the copying phase. I've seen an ARM with "hybrid harvard", which is a single address space containing various memories but two different fetch units. Code and data can be fetched in parallel, as long as they are not from the same memory. So you could fetch code from Flash and data from SRAM, or code from SRAM and data from DRAM etc.
Best Answer
First of all, 1 MS/s at 16 bit is just 2MB/s – that's really not too much for USB2 to carry. There's no need for dual port RAM, if we're talking about devices that would lend themselves to visualization or has PCIe like your Arch2 suggests, in my opinion.
The fact that you're doing visualization implies you don't care about latency – what's half a millisecond to the human eye? So, you're pretty free with respect to choice of sample transport.
So:
Arch 1
Lots of components, including an FPGA that does nothing but write a lowly 1 million samples per second to a RAM interface. I'd say, if you go that way, use a feasibly fast bus, and that would include simple SPI or QSPI, and a bit of RAM with the FPGA to implement a ring buffer. No need for dual-port RAM – you'd need to communicate information like "ok, there's new samples available for you" or "no, nothing to fetch right now", anyways.
Arch 2
PCIe sounds like a huge overhead here. Again, the rate we're talking about is 2MB/s.
Arch 3
If your ADC, and your SoC allow you to do that, start with that! Certainly sounds like the easiest, lowest-component-count solution. Often, this doesn't work for electrical reasons. SPI is absolutely a normal interface for an embedded system to have, so I'd assume that it'd be rather easy to find a controller that has it.
Problem remains that you'd still need someone to e.g. generate your sample clock etc.
Arch 4
well, yeah, as you say, a less great version of 3.
Arch 5
1MS/s isn't really high-throughput. In fact, I remember writing firmware for a now defunct ARM cortex-M0 project that ran the internal ADC at 500kS/s and pushed the data through USB2 to a PC. With a slightly more capable MCU, you should be able to do the same. That way, you'd have cheap-as-hell device dedicated to handling ADC data and stuffing it in USB packets, and you'd just have to write a couple lines of Python or C to run on your embedded device to ask the microcontroller for USB bulk packets full of data. Bonus: you can clock down your main CPU whenever you want to, and it will have no effect on the sampling.
Arch 6
Kinda easy. You can all do minimal visualization, sampling at several megasamples per second (complex) and a bit of analysis on ARM cortex-M4, with the help of a bit of glue-FPGA (without own RAM, iirc). This is proven by the open design of the HackRF one. I think it might be worth for you to look into this. From my perspective, it sounds like you'd basically just want to throw out all the RF stuff in that, and use it as is. You'd even get drivers and firmware for free!
HackRF hardware components diagrams from the project wiki
Above diagram is simplified, as mentioned, there's a small "glue" FPGA between the ADC/DAC hybrid and the LPC Cortex-M4, as the schematic will tell you.