I learned on a 68HC11 in college. They are very simple to work with but honestly most low powered microcontrollers will be similar (AVR, 8051, PIC, MSP430). The biggest thing that will add complexity to ASM programming for microcontrollers is the number and type of supported memory addressing modes. You should avoid more complicated devices at first such as higher end ARM processors.
I'd probably recommend the MSP430 as a good starting point. Maybe write a program in C and learn by replacing various functions with inline assembly. Start simple, x + y = z, etc.
After you've replaced a function or algorithm with assembly, compare and contrast how you coded it and what the C compiler generated. This is probably one of the better ways to learn assembly in my opinion and at the same time learn about how a compiler works which is incredibly valuable as an embedded programmer. Just make sure you turn off optimizations in the C compiler at first or you'll likely be very confused by the compiler's generated code. Gradually turn on optimizations and note what the compiler does.
RISC vs CISC
RISC means 'Reduced Instruction Set Computing' it doesn't refer to a particular instruction set but just a design strategy that says that the CPU has a minimal instruction set. Few instructions that each do something basic. The is no stringently technical definition of what it takes 'to be RISC'. On the other hand CISC architectures have lots of instructions but each 'does more'.
The purposed advantages of RISC are that your CPU design needs fewer transistors which means less power usage (big for microcontrollers), cheaper to make and higher clock rates leading to greater performance. Lower power usage and cheaper manufacturing are generally true, greater performance hasn't really lived up to the goal as a result of design improvements in CISC architectures.
Almost all CPU cores are RISC or 'middle ground' designs today. Even with the most famous (or infamous) CISC architecture, x86. Modern x86 CPUs are internally RISC like cores with a decoder bolted on the front end that breaks down x86 instructions to multiple RISC like instructions. I think Intel calls these 'micro-ops'.
As to which (RISC vs CISC) is easier to learn in assembly, I think its a toss up. Doing something with a RISC instruction set generally requires more lines of assembly than doing the same thing with a CISC instruction set. On the other hand CISC instruction sets are more complicated to learn due to the greater number of available instructions.
Most of the reason CISC gets a bad name is that x86 is by and far the most common example and is a bit of a mess to work with. I think thats mostly a result of the x86 instructions set being very old and having been expanded half a dozen or more times while maintaining backward compatibility. Even your 4.5Ghz core i7 can run in 286 mode (and does at boot).
As for ARM being a RISC architecture, I'd consider that moderately debatable. Its certainly a load-store architecture. The base instruction set is RISC like, but in recent revisions the instruction set has grown quite a bit to the point where I'd personally consider it more of a middle ground between RISC and CISC. The thumb instructions set is really the most 'RISCish' of the ARM instruction sets.
Most small microcontrollers will be capable of doing what you need. You could even ditch the Arduino "wrapper" and use a USB capable micro in it's place.
Microchip, Atmel, TI, ST, etc all have 8, 16, 32-bit uCs of varying RAM/FLASH/EEPROM sizes to pick from. All the modern uCs come with at least UART, SPI, I2C peripherals that can be used for your communications.
There is not a lot in them really, I'd just pick one and see how you like it.
I (currently) use ST's 32-bit ARMs and Microchip 8, 16 ,32-bit PICs.
I'd probably use a few PIC12F or 16Fs for the slave uCs and a PIC18F or PIC24F for the master.
You mention needing ~10kbits of memory (not quite clear what type or which uC needs it from your description to me though)
It's easy to determine what is suitable though, just check the RAM/ROM/EEPROM specs of each uC you look at.
For example the PIC16F1938 has:
Parameter Name Value
Program Memory Type Flash
Program Memory (KB) 28
CPU Speed (MIPS) 8
RAM Bytes 1,024
Data EEPROM (bytes) 256
So 28KB of program memory is more than enough to store non-volatile data if your program is small enough (on the newer PICs you can also read/write to program memory at run time) 10kbits will not quite fit into the RAM though, at 1024 * 8 = 8192 bits.
The 16F1527 has 1536 bytes of RAM though, so you could use this if necessary.
For the master (alternatives to Arduino) there is something like the 18F25J50 or similar, which has a USB 2.0 peripheral. Microchip provide a USB stack an plenty of example firmware to get you started with USB.
If you need something more powerful for the master, have a look at the PIC24 series with up to 256K of Flash and 96K of RAM. Or even the PIC32 which is 32-bit and up to 80MIPS.
The PICKit3 is a low price programmer that will program all the above mentioned PICs, and MPLAB (or MPLABX) is a free IDE for firmware development.
Communication can be done with I2C, which deals with the master/slave configuration and addressing easily. All you have to worry about is sending the data. 7 meters should be no problem with a reasonably quiet environment and the right setup (low value pullups - say 2.2k, low capacitance cable)
Best Answer
You ask for "context saving", but you don't seem to know what that term means in your context?
The meaning I am most familiar with is in the context of interrupts and task switching, where everything that the main program or a task relies on is saved in RAM, to be restored later. In most cases this amounts to pushing all registers on the stack, so they can be popped later.
Things can get difficult when there is context outside the CPU registers that can be used by the interrupt (or by other tasks), so it must be saved too. Think for instance of floating point coprocessor registers.
On a the old 12 and 14 bit core PICs context saving for an interrupt is a bit tricky, but it is explained in the datasheet, better read it there. Note that on these chips various memory-mapped registers can be context too, like the indirection register. If your interrupt routine uses such registers they must probably be saved (and restored) too.
Real context swapping for tasking switching is not possible on these PICs, because the stack can not be changed. There are some dirty tricks that achieve the same effect (like not using the hardware stack at all), but at a cost.
The 18F PICs and IIRC the enhanced midrange chips too have a stack that can be read and written, so real context switching is possible, but it is tedious. If you want multitasking, better look for a CPU that has a memory-mapped stack. (Nowadays a Cortex would be an obvious choice.)