Erlang isn't really performant in the way I think you're using the word. It has okay speed, but you don't want to write your game's physics engine in it. On the other hand, its concurrent performance is great--you can load up a VM with tons and tons of processes, and as long as none of them individually are pegging the CPU, the VM will efficiently get them all evaluated.
Basically, the key to this is a combination of Erlang's use of tail-calls as the only means of looping, and the VM's use of what I might call (I'm not actually sure of the name for it) "hybrid budget-based cooperative/pre-emptive scheduling."
Since all loops are implemented as tail-calls, every Erlang function effectively is a straight O(N)-time-for-N-instructions shot that ends in either another call (which either adds, or reuses a stack frame) or a return. This gives the VM the opportunity to act like a pre-emptive machine, while gaining the advantages of cooperative multitasking.
In cooperatively multitasked VMs, where coroutines have to explicitly "yield" to pass the baton, you get a huge advantage: since the instruction-set architecture can be designed to ensure that memory is in a well-defined state whenever you yield, you don't have to do the expensive context-switch thing that pre-emptive architectures do: stashing and unstashing registers, switching out memory descriptors, etc. It can literally just be a jmp instruction.
BEAM is basically a cooperatively-multitasked VM, except that every "call" instruction is also an implicit "yield". (More specifically, it's a "yield if this process has executed >= 2000 call instructions since it received control.") So you get everything nice about cooperative multitasking (you never decode only 0.8 of an audio frame before being interrupted), and everything nice about pre-emptive multitasking (nothing can hog the processor forever[1]), together; thus, soft real-time.
Another place where this design comes into play is in hot code loading. The design of the loader itself is pretty simple: modules are kept in a hash-table, keyed by name. A module is referenced by address on the stack, and in loaded (threaded) bytecode; and by key in unloaded (abstract) bytecode. When you upgrade a module, you just replace the value of the key in the module dictionary. Functions that call other functions in their own module as they're running stay in the previous version of a module; other functions that try to jump into the module--or functions from the module that make a "remote" (fully-qualified) call back into the module--jump into the new version.
This would be a lot harder and more complex, if we didn't have that guarantee that every looping construct in Erlang is implemented in terms of tail-calls. Since they are, we don't have to worry about "interrupting" a process to upgrade it when it still has some dirty state; we can just wait around, and every process will yield after a few microseconds, and we can upgrade it then, when its state is well-defined.
---
[1] --unless you call into C and that C code does something that takes a million years. This is why people tell you to not do CPU-bound things in Erlang. The semantics of BEAM's instruction-set architecture (and therefore its non-optimal speed) is essential to how it multitasks; you have to insert explicit process-yielding checks in your C code if you want it to be "non-blocking." Or you can write your C code as a "port", which means letting the OS manage it as a separate process--which, if you're writing one global matrix-transformer process or one per-MMO-zone physics-engine process, isn't that bad; but if you're writing a per-user speech-to-text analyzer, you've handed your OS the job of managing 100k real native processes. Better to just write it in plain Erlang, let the VM do its concurrency thing, take the speed hit, and scale horizontally a bit sooner than you would have had to with C. (Erlang is great at scaling horizontally.)
Curiously enough, all of the features you've mentioned are doable on the JVM, even without TCO support (I know this because I've worked on implementing them in https://github.com/puniverse/quasar – well, except for the "budget-based yield" but that's a minor change). The general idea is that instead of providing the implementation baked into the VM, you inject it into the compiled code using bytecode instrumentation (at load or compile time).
What truly separates BEAM from the JVM is the almost total process isolation, particularly, as you've mentioned, in the case of memory allocation and reclamation. This difference, however, entails tradeoffs – sometimes you'd want the one and sometimes the other.
Not so curious; this is the true meaning of Turing-completeness at work (not the pop meaning programmers use in reference to programming languages.) You can always write a virtual machine with semantics Y, that both executes on top of another Turing-complete abstract machine with semantics X, and reads X machine-code--and thus get semantics Y on machine X.
What you've effectively done is to just skip the naive VM-emulator step, and move to the optimization of dynamic recompilation, where you move the state-transitions and additional semantics from the X interpreter, into the chunks of X machine-code it would operate on. You've still implemented a Y VM; it's just distributed throughout the code output by the compiler.
[For the same reason, I'm planning to port BEAM to asm.js. Why? Because pre-emptive concurrency is just an abstract-machine semantic, and you can get it from a non-pre-emptively concurrent platform using the exact logic above. No more callbacks! (If everything that uses asm.js has Web Worker support, though, they could be used as run queues vis. BEAM's SMP support, leaving the UI thread a lot less stressed.)]
Well, Yes, but the result still benefits from the JVMs awesome performance with regards to optimizations and memory management, and its terrific monitoring and management tools. So, true, you lose some of BEAM's process isolation, but you gain performance and tooling.
So I don't know if the JVM is universal in the sense that it can provide an efficient implementation for all known or future languages, but it can certainly serve as a very good Erlang VM.
Actually, Javascript is a bigger challenge than Erlang for the JVM. In Erlang there's no dynamic dispatch, while in Javascript you have nothing but. So a different kind of JIT might be better suited for Javascript.
> I'm planning to port BEAM to asm.js
now that is a very interesting idea. Wouldn't mind keeping an eye on it if it's going to be open source!