ForeverVM: Run AI-generated code in stateful sandboxes that run forever

184 points by paulgb 4 months ago

Hey HN! We started Jamsocket a few years ago as a way to run ephemeral servers that last for as long as a WebSocket connection. We sandboxed those servers, so with the rise of LLMs we started to see people use them for arbitrary code execution.

While this works, it was clunkier than what we would have wanted in a first-principles code execution product. We built ForeverVM from scratch to be that product.

In particular, it felt clunky for app developers to have to think about sandboxes starting and stopping, so the core tenet of ForeverVM is using memory snapshotting to create the abstraction of a Python REPL that lives forever.

When you go on our site, you are given a live Python repl, try it out!

---

Edit: here's a bit more about why/when/how this can be used:

LLMs are often given extra abilities through "tools", which are generally wrappers around API calls. For a lot of tasks (sending an email, fetching data from well-known sources), the LLM knows how to write Python code to accomplish the same.

Any time the LLM needs to do a specific calculation or process data in a loop, we find it is better to generate code than try to do this in the LLM itself.

We have an integration with Anthropic's Model Context Protocol, which is also supported by a lot of IDEs like Cursor and Windsurf. One surprising thing we've found is that once installed, when we ask a question about Python, the LLM will see that ForeverVM is available as a tool and verify it automatically! So we cut down on hallucinations that way.

bluecoconut 4 months ago

I tried to do this myself about ~1.5 years ago, but ran into issues with capturing state for sockets and open files (which started to show up when using some data science packages, jupyter widgets, etc.)

What are some of the edge cases where ForeverVM works and doesn't work? I don't see anything in the documentation about installing new packages, do you pre-bake what is available, and how can you see what libraries are available?

I do like that it seems the ForeverVM REPL also captures the state of the local drive (eg. can open a file, write to it, and then read from it).

For context on what I've tried: I used CRIU[1] to make the dumps of the process state and then would reload them. It worked for basic things, but ran into the issues stated above and abandoned the project. (I was trying to create a stack / undo context for REPLs that LLMs could use, since they often put themselves into bad states, and reverting to previous states seemed useful). If I remember correctly, I also ran into issues because capturing the various outputs (ipython capture_output concepts) proved to be difficult outside of a jupyter environment, and jupyter environments themselves were even harder to snapshot. In the end I settled for ephemeral but still real-server jupyter kernels where I via wrapper managed locals() and globals() as a cache, and would re-execute commands in order to rebuild state after the server restarts / crashes. This allowed me to also pip install new packages as well, so it proved more useful than simply static building my image/environment. But, I did lose the "serialization" property of the machine state, which was something I wanted.

That said, even though I personally abanonded the project, I still hold onto the dream of a full Tree/Graph of VMs (where each edge is code that is executed), and each VM state can be analyzed (files, memory, etc.). Love what ForeverVM is doing and the early promise here.

[1] https://criu.org/Main_Page

paulgb 4 months ago

Good insight! We also initially tried to use Jupyter as a base but found that it had too much complexity (like the widgets you mention) for what we were trying to do and settled on something closer to a vanilla Python repl. This really simplified a lot.
We've generally prioritized edge case handling based on patterns we see come up in LLM-generated code. A nice thing we've found is that LLM-generated code doesn't usually try to hold network connections or file handles across invocations of the code interpreter, so even though we don't (currently) handle those it tends not to matter. We haven't provided an official list of libraries yet because we are actively working on arbitrary pypi imports which will make our pre-selected list obsolete.
> Love what ForeverVM is doing and the early promise here.
Thank you! Always means a lot from someone who has built in the same area.
ATechGuy 4 months ago

> I was trying to create a stack / undo context for REPLs that LLMs could use, since they often put themselves into bad states, and reverting to previous states seemed useful
This is interesting! How did you end up achieving this? What tools are available for rolling back LLMs doing?
psadri 4 months ago

Dynamic languages like python should allow you to monkey patch calls so that instead of opening a regular socket, you are interacting with a wrapper that reopens the connection if it is lost. Could something like this work?

taylorwc 4 months ago

Disclosure, I’m an investor in Jamsocket, the company behind this… but I’d be remiss if I didn’t say that every time Paul and Taylor launch something they have been working on, I end up saying “woah.” In particular, using ForeverVM with Clause is so fun.

orange_puff 4 months ago

May I ask how you got the opportunity to invest in this company? If you are a VC, makes sense, just wondering how normies can get access to invest in companies they believe in. Thanks
- zachthewf 4 months ago
  
  If you're an accredited investor (make sure you meet the financial criteria) you can cold email seed/pre-seed stage companies. These companies typically raise on SAFEs and may have low minimum investments (say $5k or $10k).
  YC lists all their companies here: https://www.ycombinator.com/companies.
  Many companies are likely happy to take your small check if you are a nice person and can be even minimally helpful to them. Note that for YC companies you'll probably have to swallow the pill of a $20M valuation or so.
- taylorwc 4 months ago
  
  I do indeed work in VC. But as another reply mentions, any accredited investor can write small checks into startups, and most preseed/seed founders are happy to take angel checks.

great_psy 4 months ago

Why would you want to have an ever growing memory usage for your Python environment?

Since LLM context is limited, at some point the LLM will forget what was defined at the beginning so you will need to reset/ remind the LLM whats in memory.

paulgb 4 months ago

You're right that LLM context is the limiting factor here, and we generally don't expect machines to be used across different LLM contexts (though there is nothing stopping you).
The utility here is mostly that you're not paying for compute/memory when you're not actively running a command. The "forever" aspect is a side effect of that architecture, but it also means you can freeze/resume a session later in time just as you can freeze/resume the LLM session that "owns" it.
CGamesPlay 4 months ago

Fun fact: this is very similar to how Smalltalk works. Instead of storing source code as text on disk, it only stores the compiled representation as a frozen VM. Using introspection, you can still find all of the live classes/methods/variables. Is this the best way to build applications? Almost assuredly not. But it does make for an interesting learning environment, which seems in line with what this project is, too.
- igouy 4 months ago
  
  > only stores the compiled representation
  That seems to be a common misunderstanding.
  Smalltalk implementations are usually 4 files:
  -- the VM (like the J VM)
  -- the image file (which you mention)
  -- the sources file (consolidated source code for classes/methods/variables)
  -- the changes file (actions since the source code was last consolidated)
  The sources file and changes file are plain text.
  https://github.com/Cuis-Smalltalk/Cuis7-0/tree/main/CuisImag...
  So when someone says they corrupted the image file and lost all their work, it usually means they don't know that their work has been saved as re-playable actions.
  https://cuis-smalltalk.github.io/TheCuisBook/The-Change-Log....
  > Is this the best way to build applications? Almost assuredly not.
  False premise.
koakuma-chan 4 months ago

It's the other way around, it swaps idle sessions to disk, so that they don't consume memory. From what I read, apparently "traditional" code interpreters keep sessions in memory and if a session is idle, it expires. This one will write it to disk instead, so that if user comes back after a month, it's still there.

lumost 4 months ago

Is it possible to reuse the same paused VM multiple times from the same snapshot?

paulgb 4 months ago

It's not exposed in the API yet, but it's very possible with the architecture and something we plan to expose. I am curious if you have a use case for that, because I've been looking for use cases! Being able to fork the chat and try different things in parallel is the motivating use case in my mind, but I'm sure there are others.
- derefr 4 months ago
  
  The obvious use-case (to me) is to create an agent that relies on an interpreter with a bunch of pre-loaded state that's already been set up exactly a certain way — where that state would require a lot of initial CPU time (resulting in seconds/minutes of additional time-to-first-response latency), if it was something that had to run as an "on boot" step on each agent invocation.
  Compare/contrast: the Smalltalk software distribution model, where rather than shipping a VM + a bunch of code that gets bootstrapped into that VM every time you run it, you ship an application (or more like, a virtual appliance) as a VM with a snapshot process-memory image wherein the VM has already preloaded that code [and its runtime!] and is "fully ready" to execute that code with no further work. (Or maybe — in the case of server software — it's already executing that code!)
- lumost 4 months ago
  
  Main use case for me would be RLAIF. Given a prompt, generation, and a code execution result - rank N alternative executions and execution results for DPO/other training patterns.
  In complex use cases like building a bi engineer, it’s useful to persist state across multiple function calls within the same interpreter.
- rfoo 4 months ago
  
  Check out why Togerther.AI acquired CodeSandbox.
genewitch 4 months ago

Xeon phi could do this in hardware, pause and reset to a prior state.
I forget who (lcamtuf?) made a VM thing for fuzzing that used it.

eterps 4 months ago

Why/when does someone want to use this?

paulgb 4 months ago

Good question, we’ll add some info to the page for this.
LLMs are generally quite good at writing code, so attaching a Python REPL gives them extra abilities. For example, I was able to use a version with boto3 to answer questions about an AWS cluster that took multiple API calls.
LLMs are also good at using a code execution environment for data analysis.
- csunbird 4 months ago
  
  > For example, I was able to use a version with boto3 to answer questions about an AWS cluster that took multiple API calls.
  isn't that very dangerous? The LLM may do anything, e.g. create resources, delete resources, change configuration etc
  
  jcgrillo 4 months ago
  
  It seems like a very similar issue arises with the "natural language query" problem for database systems. My best guess at a solution in that domain is to restrict the interface. Allow the LLM to generate whatever SQL it wants, but parse that SQL with a restricted grammar that only allows a "safe" (e.g. non-mutating) subset of SQL before actually issuing queries to the database. Then figure out (somehow) how to close the loop on error handling when the LLM violates the contract (e.g. generates a query which doesn't parse).
  Then of course there's the whole UX problem of even when you restrict the interface to safe queries, the LLM may still generate queries which are completely incorrect. The best idea I can come up with there is to dump the query text to an editor where the user can review it for correctness.
  So it's not really "natural language queries" more like "natural language SQL generation" which is a completely different thing and absolutely should not be marketed as the former.
  People bring up this concept as a way to make systems "more friendly to novice users" which tbh makes me a little uncomfortable, because it seems like just a huge footgun. I'd rather have novice users struggle a bit and become less novice, than to encourage them to run and implicitly trust queries which are likely incorrect.
  So it's a bit difficult to tell how much value is added here over some basic intellisense style autocomplete.
  Looking to the world of "real tools" like hammers and saws, we don't see "novice hammers" or "novice saws". The tool is the tool, and your skill using it grows as you use it more. It seems like a bit of a boondoggle to try to guess what might be good for a novice and orient your entire product experience around that, rather than simply making a tool that's good for experts doing real work and trusting that the novices will put in the effort to build expertise.
  It makes for a flashy demo, though.
  
  paulgb 4 months ago
  
  Only if you give it unfettered accesss. AWS has an API called AssumeRole which can generate short-lived credentials with a specifically scoped set of permissions, which I use instead.
koakuma-chan 4 months ago

It's probably nice to have whenever you're using an LLM that doesn't have a code interpreter, like Claude. It can probably use code execution as a reality check.
- paulgb 4 months ago
  
  Yes, I've found that just having the MCP server installed, now when I ask a question about Python, Claude becomes eager to check its work before answering Python questions (Claude does have a built in analysis tool, but it only runs Javascript).

hatf0 4 months ago

This is neat! I’m assuming that this is Firecracker (or some other microVM hypervisor) underneath the hood?

hatf0 4 months ago

If not (and you’re just raw-dogging Linux network/pid namespaces), I can see how you’ll struggle with persistence. The snapshots are larger with microVMs, but with userfaultfd, you’re able to lazily load pages back into memory as they’re accessed. Happy to chat more, my whole day job is making microVMs persistent :)
- paulgb 4 months ago
  
  Thanks, I’ll send you an email!

benatkin 4 months ago

It’s trivial to build something that does what this describes. I’m sure there’s more too it, but based on the description the pieces are already there under permissive open source licenses.

For a clean implementation I’d look at socket-activated rootless podman with a wasi-sdk build of Python.

paulgb 4 months ago

It was an afternoon to prototype, followed by a lot of work to make it scale to the point of giving everyone who lands from HN a live CPython process ;)
- benatkin 4 months ago
  
  This is the sort of thing that would touch a lot of my data so I’d much prefer to have it self hosted but you mention Claude rather than deepseek or mistral so know your audience I guess.
  
  paulgb 4 months ago
  
  Fair enough. Our audience is businesses rather than consumer, so our equivalent to self-hosting is that we can run it in a customer's cloud.
  We mention Claude a lot because it is a good general coding model, but this works with any LLM trained for tool calling. Lately I've been using it as much with Gemini Flash 2.0, via Codename Goose.

deepsquirrelnet 4 months ago

Is it possible to run cython code with this as well? Since you can run a setup.py script could you compile cython and run it?

Looking at the docs, it seems only suited for interpreted code, but I’d be interested to know if this was feasible or almost feasible with a little work.

paulgb 4 months ago

We are working now on support for arbitrary imports of public packages from PyPi, which will include cython support, but only for public pypi packages. Soon after that we'll be working on a way to provide proprietary packages (including cython).
falcor84 4 months ago

Where did you see mention of a setup.py script? I couldn't find that in their docs. From what I saw, they only support using a long-lived repl.

laserpistus 4 months ago

Great, we are looking into microVM/firecracker-like solutions for both code execution and hosting of our students very low traffic sites, so adding this to the list of things to check out.

thehamkercat 4 months ago

I have a question, why are you allowing network requests in the VM? (Tested in the python REPL which is available on your homepage)

What are you doing to prevent the abuse?

paulgb 4 months ago

We allow outgoing requests because a common use case of ForeverVM is making API calls or fetching data files (the "fetch and analyze data" button shows an example of this).
We give every repl its own network namespace and virtual ethernet device. We also apply a set of firewall rules to lock it out from making non-public-internet requests.

amelius 4 months ago

Does this use CRIU? https://criu.org/Main_Page

rfoo 4 months ago

I don't think so. This looks like to be using an actual VM instead of container tech.

carlosdp 4 months ago

I was looking for this the other day, looks great!

lopuhin 4 months ago

Congrats on the launch! How much does it cost? And what is the sandboxing technology?

autocole 4 months ago

I love this. Congrats on the launch. You all are always building something interesting

monkeynotes 4 months ago

What has AI got to do with this? It's in the headline but I don't see why.

paulgb 4 months ago

The API could be used for non-AI use cases if you wanted to, but it’s built to be integrated with an LLM through tool calling. We provide an MCP (model context protocol, for integration in Claude, Cursor, Windsurf etc.) server.
manmal 4 months ago

You might have noticed that ChatGPT (and others) will sometimes run Python code to do calculations. My understanding is that this will enable the same thing in other environments, like Cursor, Continue, or aider.
- paulgb 4 months ago
  
  Also, those code interpreters usually can't make external network requests, which is adds a lot of capabilities like pulling some data, and then analyzing it.
  
  manmal 4 months ago
  
  Ah so it could basically be „the tool“. Do you plan hooking in some vector DB as well?

TZubiri 4 months ago

How is this different than chatgpt's python code execution?

paulgb 4 months ago

ChatGPT's code interpreter is mostly used as a calculator / graphing calculator. It can run arbitrary Python code, but it is limited in practice because it can't (e.g.) make external web requests or install arbitrary packages.
This is meant to be usable for those use cases, but also to allow apps/agents to make API requests, load data from various sources, etc. It can also run in a company's cloud account, for compliance situations where they are running inference on their cloud account and want a ChatGPT-like code interpreter where data never leaves their VPC.
- TZubiri 4 months ago
  
  Looks good.
  So I can tell it to install asw cli for example and have it control my instances and such? Cool