Now, I don't know much about Clojure and little about Scala, but I'll give it a shot.
First off, we need to differentiate between tail-CALLs and tail-RECURSION. Tail recursion is indeed rather easy to transform into a loop. With tail calls, it's much harder to impossible in the general case. You need to know what is being called, but with polymorphism and/or first-class functions, you rarely know that, so the compiler cannot know how to replace the call. Only at runtime you know the target code and can jump there without allocating another stack frame. For instance, the following fragment has a tail call and does not need any stack space when properly optimized (including TCO), yet it cannot be eliminated when compiling for the JVM:
function forward(obj: Callable<int, int>, arg: int) =
let arg1 <- arg + 1 in obj.call(arg1)
While it's just a tad inefficient here, there are whole programming styles (such as Continuation Passing Style or CPS) which have tons of tail calls and rarely ever return. Doing that without full TCO means you can only run tiny bits of code before running out of stack space.
What facility of the underlying virtual machine would allow the compiler to handle TCO more easily?
A tail call instruction, such as in the Lua 5.1 VM. Your example does not get much simpler. Mine becomes something like this:
push arg
push 1
add
load obj
tailcall Callable.call
// implicit return; stack frame was recycled
As a sidenote, I would not expect actual machines to be much smarter than the JVM.
You're right, they aren't. In fact, they are less smart and thus don't even know (much) about things like stack frames. That's precisely why one can pull tricks like re-using stack space and jumping to code without pushing a return address.
Depending on the language, it may not be necessary to use a call stack. Call stacks are only necessary in languages that allow recursion or mutual recursion. If the language does not allow recursion, then only one invocation of any procedure may be active at any moment, and local variables for that procedure may be statically allocated. Such languages do have to make provision for context changes, for interrupt handling, but this still does not require a stack.
Refer to FORTRAN IV (and earlier) and early versions of COBOL for examples of languages that do not require call stacks.
Refer to the Control Data 6600 (and earlier Control Data machines) for an example of a highly-successful early supercomputer that did not provide direct hardware support for call stacks. Refer to the PDP-8 for an example of a very successful early minicomputer that did not support call stacks.
As far as I know, the Burroughs B5000 stack machines were the first machines with hardware call stacks. The B5000 machines were designed from the ground up to run ALGOL, which required recursion. They also had one of the first descriptor-based architectures, which laid groundwork for capability architectures.
As far as I know, it was the PDP-6 (which grew into the DEC-10) that popularized call stack hardware, when the hacker community at MIT took delivery of one and discovered that the PUSHJ (Push Return Address and Jump) operation allowed the decimal print routine to be reduced from 50 instructions to 10.
The most basic function call semantics in a language that allow recursion require capabilities that match nicely with a stack. If that's all you need, then a basic stack is a good, simple match. If you need more than that, then your data structure has to do more.
The best example of needing more that I have encountered is the "continuation", the ability to suspend a computation in the middle, save it as a frozen bubble of state, and fire it off again later, possibly many times. Continuations became popular in the Scheme dialect of LISP, as a way to implement, among other things, error exits. Continuations require the ability to snapshot the current execution environment, and reproduce it later, and a stack is somewhat inconvenient for that.
Abelson & Sussman's "Structure and Interpretation of Computer Programs" goes into some detail on continuations.
Best Answer
No, and with good reason, IMHO.
The Y-combinator is a theoretical construct and is only needed to make lambda calculus turing complete (remember, there are no loops in lambda calculus, nor do lambdas have names we could use for recursion).
As such, the Y combinator is truly fascinating.
But: Nobody actually uses the Y-combinator for actual recursion! (Except maybe for fun, to show that it really works.)
Tail-call optimization, OTOH, is, as the name says, an optimization. It adds nothing to the expresiveness of a language, it is only because of practical considerations like stack space and performance of recursive code that we care about it.
So your question is like: Is there hardware support for beta reduction? (Beta reduction is how lambda expressions are reduced, you know.) But no functional language (as far as I am aware of) compiles its source code to a representation of lambda expressions that will be beta reduced at runtime.