I agree that ARM is the way to go for 32-bit microcontrollers. ARM is ubiquitous and its assembly language can be used across a broad range of microcontroller families. ARM also has good support from the GCC toolchain. The ARM7TDMI chip architecture has dominated the 32-bit mcu space the last 5 years and the ARM Cortex-M3 is the emerging replacement. The Cortex-M3 does have a Harvard architecture (separate instruction and data address spaces), but I don't feel that's a limitation.
Micromint has a solid reputation, and they offer a Cortex-M3 board with configurable options for a decent price. However, if you really need DIP configuration, I've had success with the mbed
Now, the next thing is languages. You mentioned FORTH. I also recommend Python-on-a-Chip and eLua as powerful, easy to learn languages that work on this size target platform. eLua is more fully developed but has larger resource requirements than Python-on-a-Chip. Full disclosure: I'm the author of the PyMite VM used in Python-on-a-Chip. So, if your goal is to make your own language, I fully understand the joy of that exercise.
Think about it. What exactly do you envision a "256 bit" processor being? What makes the bit-ness of a processor in the first place?
I think if no further qualifications are made, the bit-ness of a processor refers to its ALU width. This is the width of the binary number that it can handle natively in a single operation. A "32 bit" processor can therefore operate directly on values up to 32 bits wide in single instructions. Your 256 bit processor would therefore contain a very large ALU capable of adding, subtracting, ORing, ANDing, etc, 256 bit numbers in single operations. Why do you want that? What problem makes the large and expensive ALU worth having and paying for, even for those cases where the processor is only counting 100 iterations of a loop and the like?
The point is, you have to pay for the wide ALU whether you then use it a lot or only a small fraction of its capabilities. To justify a 256 bit ALU, you'd have to find an important enough problem that can really benefit from manipulating 256 bit words in single instructions. While you can probably contrive a few examples, there aren't enough of such problems that make the manufacturers feel they will ever get a return on the significant investment required to produce such a chip. If it there are niche but important (well-funded) problems that can really benefit from a wide ALU, then we would see very expensive highly targeted processors for that application. Their price, however, would prevent wide usage outside the narrow application that it was designed for. For example, if 256 bits made certain cryptography applications possible for the military, specialized 256 bit processors costing 100s to 1000s of dollars each would probably emerge. You wouldn't put one of these in a toaster, a power supply, or even a car though.
I should also be clear that the wide ALU doesn't just make the ALU more expensive, but other parts of the chip too. A 256 bit wide ALU also means there have to be 256 bit wide data paths. That alone would take a lot of silicon area. That data has to come from somewhere and go somewhere, so there would need to be registers, cache, other memory, etc, for the wide ALU to be used effectively.
Another point is that you can do any width arithmetic on any width processor. You can add a 32 bit memory word into another 32 bit memory word on a PIC 18 in 8 instructions, whereas you could do it on the same architecture scaled to 32 bits in only 2 instructions. The point is that a narrow ALU doesn't keep you from performing wide computations, only that the wide computations will take longer. It is therefore a question of speed, not capability. If you look at the spectrum of applications that need to use particular width numbers, you will see very very few require 256 bit words. The expense of accelerating just those few applications with hardware that won't help the others just isn't worth it and doesn't make a good investment for product development.
This is one of those subjects that can become highly debated. There are so many different points of view, and different things are important to different people. I will try to give a comprehensive answer, but understand that there will always be someone who disagrees. Just understand that those who disagree with me are wrong. (Just Kidding.)
This answer is going to be a long one, so let me summarize this up front. For the vast majority of people the latest crop of ARM Cortex-M0/M3/M4 chips offers the best solution, the best features for the cost. This is even true when comparing these 32-bit MCUs to their 8 and 16 bit ancestors like the PIC and MSP430s. M0's can be bought for less than US$1/each and M4's for less than US$2/each so except for the very price sensitive applications the ARM solutions are very nice. M0's are very low power and should be good enough for most people. For those who are very power sensitive the MSP430s might still be a better choice, but the M0s are worth considering for even these applications.
If you are interested in a more in-depth analysis then read on, otherwise you can stop reading now.
I will now look at each area and compare the different MCU's:
Speed of Execution
Of course the 32-bit MCUs are going to be faster. They tend to have a faster clock speed, but also do more work for each of those clocks. MCUs like the ARM Cortex-M4 include DSP processing instructions, and can even have floating point support in hardware. 8 and 16 bit CPU's can operate on 32-bit numbers, but it is not efficient in doing that. Doing so will quickly consume CPU registers, CPU clock cycles, and flash memory for program storage.
Ease of Development
In my opinion, this is the most valuable reason for using modern 32-bit MCUs-- but also the most under-appreciated. Let me first compare this to the 8-bit PICs. This is the worst case comparison, but also the best to illustrate my points.
The smaller PICs basically require that the programming be done in assembly language. True, there are C compilers available for even the 8-bit PICs but those compilers are either free or good. You cannot get a compiler that is both good and free. The free version of the compiler is crippled in that its optimization is not as good as the "Pro" version. The Pro version is approximately US$1,000 and only supports one family of PIC chips (8, 16, or 32 bit chips). If you want to use more than one family then you have to buy another copy for another US$1,000. The "Standard" version of the compiler does a medium level of optimization and costs about US$500 for each chip family. The 8-bit PICs are slow by modern standards and require good optimization. You can either fork over the money for a good compiler or you can write in assembly language-- I prefer assembly in this case.
By comparison, there are many good C compilers for ARM MCU's that are free. When there are limitations, those limits are usually on the maximum size of Flash Memory that is supported. On the Freescale Codewarrior tools this limit is 128Kbytes. This is plenty for most people on this forum.
The advantage of using a C compiler is that you don't have to bother (as much) with the low-level details of the CPU's memory map. Paging on the PIC is particularly painful and is best avoided if at all possible. Another advantage is that you don't have to bother with the mess of handing 16 and 32 bit numbers on an 8-bit MCU (or 32 bit numbers on a 16-bit MCU). While it is not super difficult to do this in assembly language, it is a pain in the rear and is error prone.
There are other non-ARM C compilers that work well. The MSP430 compiler seems to do a reasonable job. The Cypress PSoC tools (especially PSoC1) are buggy.
Flat Memory Model
A MCU that has paged RAM/registers/Flash is just stupid. Yes, I am talking about the 8-bit PICs. Dumb, dumb, dumb. That turned me off of the PICs so much that I haven't even bothered to look at their newer stuff. (Disclaimer: this means that the new PICs might be improved and I just don't know it.)
With an 8-bit MCU it is difficult (but not impossible) to access data structures larger than 256 bytes. With a 16-bit MCU that gets increased to 64 kbytes or kwords. With 32-bit MCUs that goes up to 4 gigabytes.
A good C compiler can hide a lot of this from the programmer (a.k.a. You), but even then it does effect program size and execution speed.
There are many MCU applications that this will not be a problem for, but of course there are many others that will have problems with this. It is mostly an issue of how much data you need (arrays and structures) in RAM or Flash. Of course, as CPU speed increases so does the odds of using larger data structures!
Some of the small PICs and other 8-bit MCUs are available in really small packages. 6 and 8 pins! Currently the smallest ARM Cortex-M0 that I know of is in a QFN-28. While a QFN-28 is plenty small enough for most, it isn't small enough for all.
The cheapest PIC is about one third the price of the cheapest ARM Cortex-M0. But that is really US$0.32 vs. US$0.85. Yes, that price difference matters to some. But I posit that most people on this web site don't care about that small of a cost difference.
Likewise, when comparing more capable MCUs with the ARM Cortex-M0/M3/M4 usually the ARM Cortex comes out "roughly even" or on top. When factoring in the other things (ease of development, compiler costs, etc. then the ARMs are very attractive.
I guess the real question is: Why would you NOT use an ARM Cortex-M0/M3/M4? When absolute cost is super important. When super low power consumption is critical. When the smallest package size is required. When speed is not important. But for the vast majority of applications none of these applies and the ARM is currently the best solution.
Given the low cost, unless there is a good reason to not use an ARM Cortex, then it makes sense to use one. It will allow faster and easier development time with less headaches and larger design margins than most other MCUs.
There are other non-ARM Cortex 32-bit MCUs available, but I do not see any advantage to them either. There are many advantages to going with a standard CPU architecture, including better development tools, and faster innovation of the technology.
Of course, things can and do change. What I say is valid today, but might not be valid in a year or even a month from now. Do your own homework.