noxa 15 hours ago

Neat! As someone working in this space and feeling like I've been taking crazy pills from how these "duh, CPU solved this 30 years ago" things keep slipping it's great to see more people bridging the gap! Unfortunately CUDA/HIP (and the entire stack beneath them) virtual memory management ops are very expensive host APIs (remapping a big block of pages can be O(n^2) with page count and fully synchronize host/device (forced wait idle), take kernel locks, etc) so it hasn't been viable in all cases. If your workloads are submit/wait with host in the loop the VM tricks are ok but if you are trying to never block the GPU (pipeline depth > 0) you really want to avoid anything that does a page table modification (until we get GPUs that can pipeline those). vkQueueBindSparse is one of the few async APIs I've seen, and CUDA has cuMemMapArrayAsync but I haven't yet used it (because arrays are annoying and without being able to inspect the driver I'm sure it's probably doing the wrong thing).

I've had good luck with indirection tables used during lookup inside of the kernels consuming/producing the kvcache data - it's essentially user-mode remapping like they do here: you can publish a buffer offset table and threads are uniform, have coalesced reads to the table, and cache the offsets no problem. You have the same memory locality issues as VM (contiguous virtual but potentially random physical) but are not limited to device page sizes and since you can update while work is in-flight you can be much more aggressive about reuse and offload (enqueue DMA to cold storage to evict from VRAM, enqueue DMA to copy from cold memory into reused VRAM, enqueue offset table update, enqueue work using them, repeat - all without host synchronization). You can also defrag in-flight if you do want to try to restore the physical locality. It's nothing crazy and fairly normal in CPU land (or even classic virtual texturing), but in ML GPU land I could write a big paper on it and call it SuperDuperFancyAttention4 and publish press releases...

  • ivanium 6 hours ago

    (Disclaimer: I am one of the authors of the project) Thank you for the thoughtful and insightful comment. I really love the depth of your first paragraph. You highlighted a concern in this space that is often overlooked, and I am glad you raised it. We spent a significant amount of time dealing with the cost of dynamic GPU memory operations.

    One useful observation is that LLM inference has almost no host API calls during steady state, since the GPU must stay busy with continuous kernel launches or CUDA graph replay. You are absolutely right that CUDA and HIP virtual memory operations are expensive on the host side and involve heavy driver work. However, they introduce only small stalls in the GPU pipeline, because most of the cost is paid on the host. These operations are also relatively infrequent compared to kernel launches in practice, so we offload them to a background thread to keep them off the critical path. The APIs are not cheap in general, but they happen to fit LLM inference surprisingly well.

    On your second point, I guess I follow your idea, although please correct me if I misunderstood. Virtual memory does open the door to paging and offloading, which is also important for LLM systems. We are actively working on this direction in kvcached. Your defragmentation point also reminds me of classic techniques such as compaction and garbage collection. They could certainly help, though I guess the trade off between benefit and complexity would need more careful evaluation.

    Thank you again for the thoughtful analysis. It was a pleasure to read. I would be happy to continue the discussion.

CharlesW 16 hours ago

Actual title: "Solve the GPU Cost Crisis with kvcached: A library to enable virtualized, elastic KV cache for LLM serving on shared GPUs"

jewel 15 hours ago

In my imagination, I thought that the large GPU clusters were dynamically allocating whole machines to different tasks depending on load.

So, hypothetically, if ChatGPT's peak load and their minimum load were a 3× ratio, they'd reallocate 2/3 of their servers to training when it's not peak time.

Doing the same thing inside an individual GPU seems irrelevant to anyone operating at scale when they can approximate the same behavior with entire servers or even entire racks.

  • Jrxing 10 hours ago

    Sharing the big GPU cluster with non-latency critical load is one solution we also explored.

    For this work, we are targeting more on the problem of smaller models running SOTA GPUs. Distilled/fine-tuned small models have shown comparable performance in vertial tasks.

BergAndCo 15 hours ago

AI-written paper posted by JiaRong Xing

Username is Jrxing

"GPU OS" turns out to be just more LLM spam

  • Jrxing 10 hours ago

    Hi, thanks for digging out who I am. Yes, I am the author of the blog and the project.

    We polished the blog for several days. I didn't get how you could conclude that this is AI generated. Is it too good to be human written?

    • anonymous908213 8 hours ago

      My impression was that it was likely LLM-written, human-reviewed. Due to a lack of knowledge on the subject/field, I can't comment on the substance of the technical details, which often reveal the shortcomings of LLM blabble, but the writing style certainly comes across as that of an LLM.

      Most evidently, the incoherent usage of bold text littered constantly throughout the article, together with the infamous and poorly used em-dash spam. This snippet stood out to me particularly badly, as this does not seem like a case where even one of those odd humans who love em-dashes would use one:

      "You might have heard that PagedAttention manages the KV cache using memory pages, which significantly improves memory utilization. That’s true—*but only within a single application.*"

      Then you get lines like this one, which combine both random bold text and the em-dash with my most-hated LLMism, "it's not just X, but Y":

      "The history of CPU systems shows that *efficiency is not just a hardware problem—it’s also a system design problem.*"

      The introductory paragraph also has this (yet again, randomly bolded) LLM sensationalization that a human technical writer would be thoroughly embarrassed to have associated with their writing:

      "Behind the $300 billion projected spend on GPU hardware in 2025 lies *a dark truth*: much of this expensive hardware sits *vastly underutilized.*"

      Not to mention it's repeated...

      "Yet behind the headlines of record spending lies a *quieter story*: much of this expensive hardware sits *vastly underutilized.*"

      Your response of "is it too good to be human written" certainly doesn't restore confidence, notwithstanding the lack of humility that would be required to say that about what is allegedly your own writing. LLM writing is visible because it is awful, if you have any comprehension for what good writing looks like. The idea that LLM writing could possibly be "too good" is a truly despairing belief for someone to hold, because it means they themselves have so little understanding of good writing that they think an LLM can output good writing.

      I almost wanted to give you a pass for having an LLM write an English article for you, since your response hints that English is not your native language ("I didn't get how you could conclude" is a very ESL-like mistaken tense). But you apparently have a Ph.D. and are working as a professor. I'm not familiar with academic standards these days, but is it really accepted to be claiming LLM output as your own writing...?

nisten 15 hours ago

[flagged]