Running local models on an M4 with 24GB memory

578 points by shintoist 1 month ago

soganess 1 month ago

Getting so close to good!

I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B.

On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window?

Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream.

discordance 1 month ago

Could you please share your time to first token and tok/s?
- ls612 1 month ago
  
  I’m on an M2 Max and get 10 tok/s with Gemma 4 8bit MLX
- isomorphic 1 month ago
  
  M4 Pro 64GB (14 CPU / 20 GPU), Gemma 4 31B Q4_K_M GGUF, LM Studio: time to first token 0.92s, 11.56 tokens/s.
  Edit: For comparison with the other poster, same setup as above, but with Gemma 4 31B Instruct 8bit MLX (not sure if exactly the same model): time to first token 4.62s, 7.20 tokens/s; with a different prompt, 1.17s and 7.24 tokens/s.
  
  zozbot234 1 month ago
  
  Could you (or anyone with the same hardware) try antirez's ds4 and report how gracefully it degrades with only the 64GB RAM? Obviously it's going to be dog slow at best for any single inference flow, but can you meaningfully improve on that by running many sessions in parallel? (Ideally you'd need roughly on the order of model sparsity in order to get meaningful sharing of MoE weights, but whether that's genuinely achievable is anyone's guess!)
thot_experiment 1 month ago

Gemma 4 IS good, I've literally had it get a thing right that Opus 4.7 missed, the edges are ragged and I'm reliably finding usecases where it's basically equivalent. Ultimately the metric is "what can I RELY on it to do". Opus definitely knows a lot more and can sometimes do much more complex tasks, but especially when you're good about feeding the context Gemma is amazing. The difference between the sets of things I trust the two models to do is surprisingly small. I've had some insanely good runs recently working on my personal tooling as well as random projects. The first local model that can reliably left to implement features in agentic mode on non-trivial projects.
https://thot-experiment.github.io/gradient-gemma4-31b/
This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours.
running Q6_K_XL, 128k context @ q8 ~ 800tok/s read 16tok/sec write
eagerly awaiting turboquant and MTP in llama.cpp, should take me to 256k and 25-30tok/s if the rumors are true
- thot_experiment 1 month ago
  
  Re-posting this from a buried comment for visibility because it's just so fucking impressive to me.
  I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.
  idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.
  
  AntiUSAbah 1 month ago
  
  It definitly is and just a few years ago unheared of.
  And we progress on so many different frontiers in parallel: Agent harness, Agent model, hardware etc.
  
  hparadiz 1 month ago
  
  A technology indistinguishable from magic.
  
  AdamConwayIE 1 month ago
  
  Had a very similar experience recently.
  Built a basic authentication handler for this test just so it wouldn't be in the training data of either model. It had deliberately planted bugs. One was a hardcoded secret, another was a wrap-on-0xFFFFFFFF bug as a result of a malloc(length+1).
  Qwen 3.6 found both, alongside two other issues I hadn't even considered, and the location of the magic value. GPT-5.4, though, missed the malloc issue (flagging memory exhaustion as the only risk), it missed a separate timing bug (it explicitly said the function was safe), and it hallucinated the location of the magic value. Qwen correctly identified the integer overflow. GPT-5.4 did not.
  I then compared basic research between them using SearXNG for web search. For example, the current status of MTP in llama.cpp. Qwen 3.6 27B found the current PR, but flagged a related issue that shows the current implementation can be slower than just using a draft model right now. GPT-5.5 Thinking found the same PR, but didn't flag the downsides.
  In a similar comparison, I asked both models how I should get started with ESPHome as a total beginner. ChatGPT suggested an ESP32-S3 and a BME280, which is... just not a good idea. It also talked about the ESP32-P4 not having Wi-Fi, and installing with HA or Docker. Meanwhile, Qwen3.6 27B said regular ESP32, DHT22, and mentioned HA, Docker, and pip as installation methods. While GPT was good, it was just throwing out jargon for a prompt that explicitly requested it for a beginner.
  It kind of blew my mind that in all three of these, Qwen landed it better.
plufz 1 month ago

Does gemma work better than qwen3 in your experience?
- 2ndorderthought 1 month ago
  
  Not in mine. I see a lot of people talking about Gemma on here but in my circles pretty much everyone else is running qwen.
pdyc 1 month ago

i use smaller model gemma e2b for most of my editing and it works surprisingly well. Workflow is planning with sota models and execution via small models. If you plan properly dont leave ambiguity for smaller model it works well.
- 2ndorderthought 1 month ago
  
  Out of curiosity have you tried other small models? The e2b for me was unusable. Llama3.2 3b was better and that thing is a year old and I rarely use it now too.
  
  pdyc 1 month ago
  
  yes i keep on trying small models, i have also tried qwen 3.5 0.8B, 2B, 4b and gemma4 e4B models but they either did not worked reliably (thinking loop, issue in following instruction) or there were performance issues (prompt speed, tg speed, too much ram) e2b was the sweet spot where i could give it plan and it can edit files properly.
  
  2ndorderthought 1 month ago
  
  That makes sense it sounds like your computer isn't super powerful. Whatever works for you
  
  Melatonic 1 month ago
  
  How did e2b compare to e4b ?
  
  pdyc 1 month ago
  
  i did not see much improvement for my use case i.e. file editing tasks but with e4b tg/s is lower so i stick with e2b.
gertlabs 1 month ago

The small Qwen 3.6 models handle context a little better than Gemma 4, but Gemma 4 26B in particular has such small and efficient solutions which are really smart for its weight class. I was so impressed with its performance in our benchmark upon release that I wrote a blog post about it [0], although its position on the leaderboard later fell a bit as we ran it in more long context agentic coding environments.
[0] https://gertlabs.com/blog/gemma-4-economics
- spwa4 1 month ago
  
  Here's a great explanation why:
  https://www.youtube.com/watch?v=_A367W_qvc8
  Google's messing with the context. LOTS of speed for a little worse long-context performance.
alfiedotwtf 1 month ago

What's your opinion with Gemma 4 vs Qwen3.6?
prettyblocks 1 month ago

It's great, but I wish I could use these things without it feeling like my laptop is going to melt through the desk.

quacker 1 month ago

I could have used this article before I spent the weekend arriving to the same conclusion!

Same laptop, and my contrived test was having it fix 50 or so lint errors in a small vibe-coded C++ repo. I wanted it to be able to handle a bunch of small tasks without getting stuck too often.

GPT OSS 20B was usable but slow, and actually frequently made mistakes like adding or duplicating statements unnecessarily, listing things as fixed without editing the code, and so on.

Qwen 3.5 9B with Opencode was much faster and actually able to work through a majority of the lint warnings without getting stuck, even through compaction and it fixed every warning with a correct edit.

I tried 4bit MLX quants of Qwen 3.5 9B but it eventually would crash due to insufficient memory. I switched to GGUF, which I run with llama.cpp, and it runs without crashing.

It is absolutely not comparable to frontier models. It’s way slower and gets basic info wrong and really can’t handle non trivial tasks in one go. I asked it for an architecture summary of the project and it claimed use of a library that isn’t present anywhere in the repo. So YMMV, but it’s still nice to have and hopefully the local LLM story can get much better on modest hardware over time.

solenoid0937 1 month ago

> It is absolutely not comparable to frontier models.
This is not said often enough.
Yes, local LLMs are great! But reading most HN posts on the subject, you'd think they're within reach of Opus 4.7.
There is a very small, very vocal, very passionate crowd that dramatically overstates the capabilities of local LLMs on HN.
- HDBaseT 1 month ago
  
  At least in my experience, local models are very far away from models like Opus 4.7 or ChatGPT 5.5 in coding and problem solving areas.
  I find them useful in basic research and learning and question asking tasks. Although at the same time, a Wikipedia page read or a few Google searches likely could accomplish the same and has been able to for decades.
  
  darkstar_16 1 month ago
  
  I think you're doing it wrong. Use the frontier moddels for the research, planning etc and once you have a plan give it to a local model for implementation.
- thot_experiment 1 month ago
  
  Very different from my experience, Gemma 31b just solved a physics problem Opus 4.7 gave up on. I definitely don't think they're equivalent in general, Opus for sure is way smarter and way more likely to get things right on the edge, but it's still quite likely to get things wrong too it doesn't make it that useful for a lot of stuff. Conversely there are so many things that you would use an LLM for that they will both reliably oneshot. Especially in agentic mode where you have ground truth feedback between turns the difference gets quite small for a lot of tasks.
  That all being said I've spent hundreds (maybe thousands?) of hours on this stuff over the past few years so I don't see a lot of the rough edges. I really believe capability is there, Gemma 4 31B is a useful agent for all sorts of stuff, and anything you can reasonably expect an LLM to oneshot Qwen 3.6 35b MoE will handle at like 90tok/sec, absolutely fantastic for tasks that don't require a huge amount of precision.
  
  fg137 1 month ago
  
  Sure. Sample size = 1.
  
  thot_experiment 1 month ago
  
  It may surprise you but over thousands of hours I have actually gathered more than one sample.
  EDIT: Here's another sample for ya. I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.
  idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.
  
  K0balt 1 month ago
  
  What is your opinion on qwen 35b MOEvs qwen 27b dense?
  
  thot_experiment 1 month ago
  
  Maybe a skill issue but they both feel about the same and the MoE is 3x faster so I barely use the dense model.
  
  latable 1 month ago
  
  Not the person asked but on a medium bug that would span a few python files, I found the MOE be too enthusiastic trying things without trying to understand first the issue, when the dense model though hard and added debug statements to understand how to fix it. But the dense model is quite slow (Q4KM quant, MI50 32GB, llama.cpp, pi)
  
  2ndorderthought 1 month ago
  
  The models op is using are from a year ago. The big breakthroughs happened in April this past month
  
  coldtea 1 month ago
  
  If it works for me it works for me. Sample size of 1 is all I need to tell that.
  
  baq 1 month ago
  
  lots of interesting things happen in anecdotes.
- fg137 1 month ago
  
  This.
  I have seen way too many people who are overly optimistic about local LLMs.
  Having spent a decent amount of time playing with them on consumer nvidia GPUs, I understand well that they not going to be widely usable any time soon. Unfortunately not many people share that.
  
  close04 1 month ago
  
  Not this. Let's reframe the problem. How many years behind do you think they are? By all accounts Gemma 4 is better than a frontier model from 3 years ago. Back then we were wowed by frontier models but when the local model reaches the same performance it's no good anymore, because you moved the target?
  Relatively speaking local models might always be behind the curve compared to frontier ones. You can tell by the hardware needed to run each. But in absolute terms they're already past the performance threshold everyone praised in the past.
  Right now in a lab somewhere there's a model that's probably better than anything else. There's a ChatGPT 5.6, an Opus 4.8. Knowing that do you suddenly feel a wave of disappointment at the current frontier models?
  
  2ndorderthought 1 month ago
  
  So the cofounder of hugging face made a post about qwen 3.6 being atclaude level of performance for the lols?
  When were you trying local models? The model releases from April 2026 are a serious change in performance.
  
  solenoid0937 1 month ago
  
  It's just not there yet. I have tried all the models from April, including the Gemma 4 variants.
  These are so far from Opus it's not even funny. They are not close to being in the same league. Gemma might be like a frontier model from a couple years ago, but with much worse performance in long context chats.
  
  anon373839 1 month ago
  
  Hm. I think there is a bit of a shifting goalpost dynamic at play here. Those April releases, even the fast MoE versions, are better than big cloud models from 18 months ago. I remember when everyone was gushing about Sonnet 3.7 and what a transformative experience development was using it. So was it useful or wasn’t it? A tool doesn’t lose its usability just because a better one comes along.
  To me, these small local LLMs are highly useful (and this “usable”) even though they don’t match the output of today’s frontier models.
  
  2ndorderthought 1 month ago
  
  Completely agree. I would even shift the 18months up a bit. I have been impressed with qwen3.6
  
  2ndorderthought 1 month ago
  
  Correct they aren't opus. They are sonnet with a little hand holding. They also run on a single GPU at 40 tps.
  No one is saying a local model will give you anthropics business in a 5min download. People are saying, "hmm, maybe I should do this one locally". People are also saying "this is surprisingly good enough for me given the trade offs"
  
  fg137 1 month ago
  
  > "hmm, maybe I should do this one locally"
  If your time is worth nothing to even triage that question.
  Unless you have fanatic needs for data privacy or really don't have Internet, running local models almost certainly results in negative ROI overall.
  Not to mention that you need to have decent hardware (that is getting expensive by the day) to even have this conversation in the first place.
  People in this post talk as if everyone has a Mac with 24GB or 32GB RAM. When the reality is that most people use a Windows laptop with crappy integrated GPU.
  
  fg137 1 month ago
  
  I'll believe that when Uber deploys local models for developers and ask them to prefer local models over proper Anthropic ones.
- AntiUSAbah 1 month ago
  
  You are missing context.
  A local model is as good as a frontier model for responding on a signal threat with you which requieres basic tool calling.
  A local model is as good as a frontier model of writing a joke.
  A local model is as good as a frontier model at responding to an email.
  Not sure what needs to be said often enough, no one without a clue would play around with local model setup and would compleltly ignore frontier models and their capabilities?!
- ActorNightly 1 month ago
  
  Im like 50% convinced that these people are paid by Apple to promote their products. Because the conversation is always just being able to execute models (even larger ones), on mac hardware with unified memory, but nobody ever mentions that inference speed is unusably slow.
  You can have good local LLM performance through agents, but you need fast inference. Generally, 2x 3090 or at the minimum 2x3080s (you need 2 to speed up prefill processing to build KV Cache). You just ironically need to be good at prompt engineering, which has a lot of analogue in real world on being able to manage low skilled people in completing tasks.
- 2ndorderthought 1 month ago
  
  The guy is running potato models!
- OtomotO 1 month ago
  
  That's totally fine and dandy as there is a very big, very vocal, very brainwashed crowd that dramatically overstates the capabilities of remote LLMs on HN as well.
layoric 1 month ago

Honestly surprised to hear that GPT OSS 20B runs slow on mac hardware. It's absolutely one of the fastest models I've run on local GPUs for its size, but only tried Nvidia cards.
Edit: TIL it is MoE and only has 3.6B active, explains a lot.
- quacker 1 month ago
  
  Yeah, I'm probably wrong there. GPT OSS 20B is certainly much faster than some other models I've tried. I actually gave GPT OSS 20B a few prompts just now and it seems to respond as fast or faster than Qwen 3.5 9B. But I needed many more prompts for GPT OSS 20B to complete my contrived task, so progress felt much slower.
2ndorderthought 1 month ago

Try qwen3.6.35 a3b not qwen3.5 9b. It's completely different.

ChrisMarshallNY 1 month ago

> The longer you let it drive without constraints, the worse the wreckage gets. The velocity makes you think you're winning right up until the moment everything collapses simultaneously.

In my experience (so far), I can’t let the LLM write too much in one go.

I need to test the hell out of what it gives me, and I can’t ask for too much, at one time.

I tend to ask it to “flesh out” functions, where I have a signature, and a detailed headerdoc comment. I will provide a lot of guidance about the context, often attaching relevant files.

Even then, it often doesn’t give me what I need, first time, unless it’s a small function, with extremely limited scope.

That said, it’s been extremely helpful. It has accelerated my development greatly.

I have found that it gives me much better PHP, than Swift.

I suspect that may be because PHP is extremely mature, and there’s millions and millions of lines of high-quality code out there, in open-source repos, while Swift is probably mostly in closed repos, with open stuff not really provided by experienced developers (it’s a proprietary language used for shipping commercial software, so that may also apply to other languages).

What it gives me in Swift, most closely resembles stuff that enthusiastic newer folks would do, and want to show off.

hugmynutus 1 month ago

> What it gives me in Swift, most closely resembles stuff that enthusiastic newer folks would do, and want to show off.
The same is true for rust-lang. Code that will immediately clone/re-allocate anything passed by reference and collect everything to the heap that is passed by `Iterator`/`IntoIterator`.
It is a massive performance anti-pattern and the hallmark of somebody "struggling" with the borrow checker. Naturally a lot of 1st & 2nd 'I just learned rust' projects lean on it. Which is totally fine for humans, you're learning. But with LLMs that pattern is now burned into their eigenvectors with the heat of a billion hours of H100 training time.
It has gotten to a point that all code I generate with Opus or Codex if there as iterator or reference in the argument, I start a fresh context, with a sort of `remove unnecessary clones, collections, and copies from the following code: {{code}}`
- krferriter 1 month ago
  
  > It has gotten to a point that all code I generate with Opus or Codex if there as iterator or reference in the argument, I start a fresh context, with a sort of `remove unnecessary clones, collections, and copies from the following code: {{code}}`
  What does it do if you put "Avoid unnecessary clones, collections, and copies" in your CLAUDE.md/AGENTS.md?
  
  hugmynutus 1 month ago
  
  It makes no difference at all.
  Edit: Opus prior to the context nerf it worked more often than not. Current Opus 4.7 is practically unusable.
OtomotO 1 month ago

No, there are millions upon millions of mediocre lines of code out there.
And LLMs tend to converge on mediocrity. Which is totally fine.
- ChrisMarshallNY 1 month ago
  
  I find that there’s a surprising amount of really good stuff out there.
  PHP has come of age. Actually, it’s been a backbone technology for millions of professional sites and apps for many years, and people tend to work in the open. Sort of the nature of the language.
  There’s a popular perception that PHP programmers are bad programmers, but that’s a dated point of view. Pros have been using it to make serious money, and create serious infrastructure, for many years.
  
  OtomotO 1 month ago
  
  I am not dissing PHP, I am saying that the absolute majority of any code out there is mediocre and not super good and not super bad.
  
  ChrisMarshallNY 1 month ago
  
  I get the feeling that there’s a lot more “mediocre-to-good” PHP code (probably C, as well), than Swift.
  I see terrible Swift code, frequently. It’s the kind of language that encourages “clever” approaches (as do most new languages), and these are favorites of less-experienced devs.
onlyrealcuzzo 1 month ago

> In my experience (so far), I can’t let the LLM write too much in one go.
Second, but I've found a cheat code to make it much farther with minimal intervention.
Step 1: tell them your goal, have them generate a doc, include design principals, system invariants, and acceptance criteria.
No amount of CLAUDE.md or skills beats re-iterating the focus points directly in the prompt.
Step 2: tell them to summarize the doc (pay close attention here). Have them save it somewhere (I use docs/agents) once you're happy with it.
Step 3: tell them to build a detailed plan to meet the objectives of the doc.
Step 4: let them go wild.
Step 5: once they declare "done", feed their progress to another LLM (Gemini is quite decent for review, and free) -> mindlessly feed the feedback back to the implementing LLM.
Step 6: Say the magic words: https://github.com/cuzzo/clear/blob/master/docs/retrospectiv...
Again, I've found no amount of skills or CLAUDE.md beats slightly modifying a prompt to meet your exact goals specific to the design and what you know of the implementation so far.
Step 7: Have them rebuild a plan to address feedback.
Step 8: Let them go wild. Loop back to Step 5 until the LLMs tell you there's no major action items.
Step 9: Tell them to remove anything from the commit that's not strictly necessary, get rid of comment changes that aren't strictly necessary, etc.
Step 10: here and only here do you invest your time (worth 100x what you're paying them) to look at what they did. Here you can give them feedback to address anything you saw.
Step 11: Review.
Step 12: Profit $$$
I got a quite decent implementation of Finite State Machine and Thunk + Trampoline transformation of code in custom language I'm building in about 1 day, barely checking in while commuting to and from work on the train...
Occassionally, at step 11, you will find a gigantic turd and wonder how the LLMs converged on this. But, typically, it's at least good enough at that stage.
I don't even waste my time looking at anything they've done until they've converged on a good design and implementation with no holes, no feedback, no notes that does what a minimal, summarized doc clearly states and follows the design principles. Because they DEFINITELY haven't in a one-shot.

nl 1 month ago

I think it's useful to be realistic about what you can do with a local model, especially something as small as the 9B the author is using. A 9B model is around the level of Sonnet 3.6 - it can do autocomplete and small functions but it loses track trying to understand large problems.

But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.

My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!

This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)

It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.

It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.

There's quite a lot there: press "Tour" to see it all.

Will be open source next week.

furyofantares 1 month ago

But I was doing a lot more than autocomplete and small functions with Sonnet 3.5.
- Xeoncross 1 month ago
  
  I agree, earlier Sonnet wasn't that great, but Sonnet 3.5 is where things really came together. The difference was night-and-day. Sonnet 3.7, 4.0, 4.5, etc... didn't have as drastic of a change to me.
  
  walthamstow 1 month ago
  
  I remember even after 3.7 was released I kept using 3.5 in Cursor because it just did exactly what I wanted
potatoman22 1 month ago

Not to be nitpicky, but many of the 4-12b models are somewhere between GPT-3.5 and GPT-4o-mini. It's hard to find a good comparison though, because the benchmarks people score models against change so often. For reference, Sonnet 3.6 came out about a year after GPT 3.5
- nl 1 month ago
  
  Don't worry about being nitpicky! I'm going to out-nitpick you....
  Actually....
  I write and publish my own benchmark for this stuff. It's an agentic SQL benchmark which isn't in the training data yet and I've found can separate frontier models from close-followers (the only models to get 100% are Opus 4.6 and GPT 5.5).
  The best small model I've found is a fine-tune of Opus-3.5 9B which scores 18/25: https://sql-benchmark.nicklothian.com/?highlight=Jackrong_Qw...
  Haiku 4.5 scores 20/25, and Haiku is certainly better than Sonnet 3.6. GPT 3.5 scores 13/25.
  
  potatoman22 1 month ago
  
  Neat! It seems like Qwen 9b took the same amount of time as gemma4-e4b too, which is interesting. I haven't been able to get Qwen to stop thinking so much

PAndreew 1 month ago

Critics are (rightly) pointing to the fact that these models are not on par with SOTA for complex coding tasks. But many seems to forget that a large part of white collar office work is Excel crushing, file moving, translating dry legal documents, e-mail drafting, PPT drudgery, etc. These are absolutely doable with 30-35b+ models with the added benefit of keeping company data private.

tjoff 1 month ago

Arguably excel and legal are much worse than code because catching the mistakes can be much harder.
Case in point, JPMorgan London Whale incident, $6 billion loss caused by an excel error...
- PAndreew 1 month ago
  
  Yes... I mean organisations have to adapt to this new working scheme. First they need new processes (maybe borrowed from SW development) that enables them to triage work products on a risk/reward scale. For example my wife works on medical device tenders. It is obligatory to translate every frikkin Word document to our native language which in the end noone will read. Do we use LLMs to do the translation? Hell yeah. For a critical legal document? Eeee. Also I think enablers like speical harnesses shall be developed/improved by keeping these folks in mind. For example to build hooks into the harness that forces the LLM to test/review/sample its output. So yes it's a complex topic, but my point was rather that the inherent capabilities of medium-large-ish open LLMs are sufficient for let's say 70-80% of such office work, and it's a huge market.
2ndorderthought 1 month ago

I think the conclusion is flawed here? Sure qwen3.5 9b is nowhere near the sota models. It's 9b and was made a year ago? Everyone taking about local models is pumped about the models released in April this year. Qwen 3.6 27b and qwen 35b a3b if you have a sad GPU. Those are comparable to sota models, seriously.

sourc3 1 month ago

I am running qwen 3.6 9b quantized model on my m4 pro 48gb and it is barely useful to do some basic pi.dev/cc driven development. I think 128gb desktops are the sweet setup to actually get meaningful work done. However, getting your hands on one of these machines is difficult at the moment.

As much fun as it is to run these things locally don’t forget that your time is not free. I am slowly migrating my use cases to openrouter and run the largest qwen model for < $2-3/day with serious use for personal projects.

hparadiz 1 month ago

How does it (the openrouter version) compare to ChatGPT 5.5 or Claude Opus 4.6?
- sourc3 1 month ago
  
  Good enough. It gets 60-70% of the work I need done for a lot less $ (keep in mind I am using these for personal projects that doesn’t generate revenue). If I was using it with the hopes of making money I think I would just use Codex at this point.
carbocation 1 month ago

Was the choice of such a small model driven by a desire for high tok/sec? I ask because an m4 pro 48gb machine can run larger models (if model intelligence is the thing that would make it more useful).
- sourc3 1 month ago
  
  Yes that was my goal. Also noticed a huge performance gain going from ollama to mlx. Your mileage may vary.
elij 1 month ago

I'm using the 30b MOE model on same spec with 65k tokens as a sub agent with tooling and it absolutely writes decent code. The dense 9b I agree wasn't great.
sjones671 1 month ago

Thanks for saying this. There's so much nonsense out there online about local models being better than Opus 4.7 and the like. It's just not true for regular users.
I have a brand new M5 MacBook Pro - top end with all the specs and I've tried local models and they're barely functional.
- Yukonv 1 month ago
  
  What models and quantizations have you been trying? I've had great success with the larger Qwen 3.x models at 6-bit levels. Using 6 bit quantization is really the bare minimum to give local models a fair shot at agentic flows. Once you start pushing below that the models become more "dumb" from the limited bit space.
- SecretDreams 1 month ago
  
  The main benefits for local are:
  1) control 2) privacy 3) transparent cost model
  Cloud has tremendous value for speed, plug and play, and performance. You need to decide how those compete with the benefits of local - both today, and a year from now, e.g.
Casteil 1 month ago

Why not 35b-a3b? ...or gemma4:26b-a4b? Both will be more capable than 9b and run at roughly similar (perhaps faster) speeds

ionwake 1 month ago

I have an M4 Macbook Air with 32Gb.

These are my current results for my models:

  ┌──────────────────────┬───────────┬─────────────┐
  │        Model         │   Size    │ Tokens/sec  │
  ├──────────────────────┼───────────┼─────────────┤
  │ gemma-4-e4b-it-mlx   │ ~4B (MLX) │ ~10.5 tok/s │
  ├──────────────────────┼───────────┼─────────────┤
  │ qwen3-8b-uncensor-v2 │ 8B        │ ~6.3 tok/s  │
  ├──────────────────────┼───────────┼─────────────┤
  │ qwen3-14b-uncensored │ 14B       │ ~3.5 tok/s  │
  └──────────────────────┴───────────┴─────────────┘

I seem to be doing ok with the Gemma model for file parsing / handling.

ActorNightly 1 month ago

<=10 tok/sec is unusable. You are faster writing the code yourself.

rapatel0 1 month ago

I got qwen3.6:27B running on my 4090 (24GB) with ~128K context leveraging some of the recent turboquant/rotorquant memory optimizations for activations. Highly suggest going up to that. the q4_xl+rotorquant combo is pretty good.

Some reference code if you want to throw your agent at it. https://github.com/rapatel0/rq-models

dmichulke 1 month ago

Forgive my ignorance but aren't they already on huggingface?
I assumed turboquant optimizations are already everywhere - in llama-cpp, or the quantization machinery of unsloth and the likes.
- rapatel0 4 weeks ago
  
  I forked it to also add rotorquant. This is a specific optimization that uses clifford rotors instead of static compile time random purmutation to store the activations. Reduces space and parameter count for the storage.
altruios 1 month ago

What is your exp on performance +40k tokens? I've not gone past that as I've heard reports that were problems start to arise. I'd be happy to know your experience in that regard.
- rapatel0 4 weeks ago
  
  I'm super happy with the performance, I generally run with 2 parallel slots so I only get about 128K context window. My experience with all llms is that they get more forgetful if you use the full window. (256-512K is the sweet spot for frontier models, 128k works for me with this current qwen)

isaisabella 1 month ago

I'd rather spend thousands dollars on a Mac than subscribing API. The local model allows me to do my work any time and anywhere, without worrying about privacy leak.

claysmithr 1 month ago

me too. plus, I don't like the idea of needing massive datacenters, it's not good for anybody
- losvedir 1 month ago
  
  Datacenters are more efficient, though, because of batching.

canpan 1 month ago

Recent models (Qwen 3.6 and Gemma) can really do coding locally. Feels like SOTA from maybe a year ago? But you would want about 32-40GB total memory. 24GB is just a bit short of that. A gaming PC with 16GB graphics card and 32GB RAM brings you very close to a usable coding system.

DrBenCarson 1 month ago

How are you using that RAM with the GPU?
- canpan 1 month ago
  
  Llama.cpp with automatic offload to main memory. You can also use Ollama, it is easier, but slower.
  
  reverius42 1 month ago
  
  For those who want a GUI, LM Studio does this too (with llama.cpp as the backend I think). I'm getting great (albeit slow) results with Qwen3.6-35B MoE on 8GB GPU RAM, 40GB system RAM.
solenoid0937 1 month ago

> Feels like SOTA from maybe a year ago?
Agree but only for small projects. SOTA from a year ago still wins on larger projects
wktmeow 1 month ago

That’s the exact ram/vram combo of my desktop - what model would you suggest for that gaming pc setup?
- canpan 1 month ago
  
  I would recommend to start withQwen 3.6 35B at maybe Q5, it should be fast in that setup. For intelligence Qwen 3.7 27b, is smarter but will run much more slow. Others also mention gemma 4, which might be worth a try.

ThomasBb 1 month ago

Beyond the models getting better; there are still huge gains available in the inference engine side with new tricks like Dflash, MRT, turboquant - for some usecases these can multiply the speeds. There are even some model specific optimized kernels like for DeepSeek 4 flash that seem wild.

Makes me feel we are nowhere near the optimum yet.

Examples: https://dasroot.net/posts/2026/05/gemma-4-speed-hacks-mtp-df...

https://x.com/bindureddy/status/2052982206344409242?s=46

brrrrrm 1 month ago

what's MRT?
- ThomasBb 1 month ago
  
  Sorry, autocorrect got me there: MTP is what I meant.

adam_arthur 1 month ago

I recently found Gemma 4 e4b surprisingly effective for small "classification" style tasks for something I'm doing at work.

In this case, picking out "semantic" css classes on single dom nodes.

Was able to run it on my 4(?) year old M2 mbp with 16GB of ram and it runs in only 100ms or so per query. Probably it can run much faster, but haven't experimented with batching etc

With tight and targeted context control, you can use extremely small models for useful things. Ideally with problems where the harness can be mostly deterministic and you have known bounds on what you're trying to do

rtpg 1 month ago

What kinda harness do people use with these local models? I am quite happy with the Claude Code permission model and interface in general for coding stuff (For chat-y interfaces I have no real opinion)

nu11ptr 1 month ago

Still trying to understand if a Macbook Pro M5 Max with 128GB is likely going to be able to run coding models well enough that I can cancel my Codex, or even go down to the $20/month plan.

guessmyname 1 month ago

A 128GiB MacBook Pro in Canada is what, north of CAD $11k after tax? That’s around USD $7k. At $20/month for a cloud AI subscription, you’re looking at almost 30 years of service for the same money.
How long do people realistically expect a laptop to stay competitive with SOTA local models? Especially in a space where model sizes, context windows, and inference requirements keep moving every year.
And even if the hardware lasts, the local experience usually doesn’t. A heavily quantized local model running at tolerable speeds on consumer hardware is still nowhere near frontier hosted models in reasoning, coding, multimodal capability, tool use, or reliability.
The economics just don’t make sense to me unless you specifically need offline inference, privacy guarantees, or low latency for a niche workflow. Otherwise you’re tying up $10k upfront to run an approximation of what you can already access through a subscription that continuously improves over time.
You could literally put the difference into index funds and probably cover the subscription indefinitely from the returns alone, even accounting for gradual price increases.
- nu11ptr 1 month ago
  
  You are assuming I'd only get it for that. That would probably just be the straw that broke the camels back, but I'm already thinking about a purchase even if that doesn't work out.
- tom_ 1 month ago
  
  But what if you were going to buy a laptop anyway? Obviously you can't do anything with less than 64 GBytes these days, so the question is just whether you go for the jump to 128.
  In the UK, it's currently an extra £800 to get a 128 GB vs the 64 GB equivalent. So that's more like 3 years of Claude - I think? - assuming current prices stay the same.
  Or: you might just feel like £800 isn't an unjustifiable amount of money (one way or another), and tick the box, on the basis that it might just work out. As the saying goes, in for 459,900 pennies, in for £5,399...
  
  gabagool 1 month ago
  
  > Obviously you can't do anything with less than 64 GBytes these days
  I don't think that's true. Plenty of people can run basic workflows at 8GB on the MacBook Neo and most others are fine at 16 GB.
  
  nu11ptr 1 month ago
  
  I am a developer, as many of us on here are. I currently have 32GB of RAM and am constantly fighting swap. 64GB would be min even w/o local model.
  
  jval43 1 month ago
  
  Realistically it's 48 M5 Pro vs 128 M5 Max due to constraints on how you can configure them. So a more substantial difference of ~2k US.
  
  tom_ 1 month ago
  
  I didn't click through the full UI to get the lead time or anything; I just looked at the options presented on the UK site. Maybe there's a stock of laptop types here that have all sold out elsewhere? Or maybe they were just teasing me, and I'd have been hit with a 6+ month delivery time if I'd gone all the way.
  
  winrid 1 month ago
  
  I rebuilt the entire fastcomments moderation UI 2yrs ago with webstorm on my 16gb thinkpad. 64gb is nice but not needed. I wonder if every dev didn't use an M4 Pro if software wouldn't be so resource hungry...
- brcmthrowaway 1 month ago
  
  This is one of the best takedowns of local models I've ever seen.
  I just hate paying money for cloud subscriptions, and work has given me a decent laptop
- knollimar 1 month ago
  
  You have to use the item a lot, to the point where you'd be exceeding subs a lot
- 2ndorderthought 1 month ago
  
  You can buy a used GPU for under 400 dollars if you already have a desktop and run qwen 3.6 a3b and for a majority of frontier tasks get by just fine. Why do you need to spend 10k on a laptop, we are swimming in ewaste.
- dale_glass 1 month ago
  
  Buy a Framework Desktop with 128 GB instead. It's half the price, and though I bought it for even less before RAM prices went crazy.
Yukonv 1 month ago

Have been using Qwen 3.6 27b recently along with various other models the last month and it is very capable for writing code at a level I haven't need to use a subscription for 95% of what I throw at it. Been using it to write extensions for Pi to expand tool kit without much fuss as one example. Is it as fast or SOTA? No, but you can't ignore how functional it is on hardware you own. Where it can begin to struggle is giving too open ended prompts or investigating complex technical issues. At that level its knowledge is not high enough to solve those problems on its own.

Xeoncross 1 month ago

Is it better to have an M4-M5 Pro with 32GB of ram or an M1-M2 Max with 64GB of ram? They seem about the same price.

It seems like cache layers like https://omlx.ai make more RAM better than more GPU cores or faster CPUs cores, but I'm curious if someone has tested both.

threetonesun 1 month ago

When I was considering a local setup the M1 Ultra Studios with 128 GB of RAM seemed to be the best price:performance at the time. I think RAM always wins out.
Also minor note: the M4/5 Pros come in multiples of 12, so it's a 24/36 or 48GB set up.

y42 1 month ago

Having an M3 with 36 GByte I was under the assumption, that I can utilize like Qwen and similar models. It's quite easy to set up, you can use pi or hermes for CLI access, or "Continue" to use it in VS Code. You can choose between omlx, Ollama and even more to run the model itself. It's no rocket science, but the results are also not satisfying.

I use it occassionally for very easy tasks, fix typos or update meta data in blog posts. So yeah, it improves productivity. But coding-wise it's far away from Codex, Claude et al.

noashavit 1 month ago

Gemma4 is a huge improvement and it's fast. Qwen 3.5 really slows down my machine though. LMK if there is a better model to use for the code assist aspect- performance wise

Casteil 1 month ago

Qwen3.5/3.6 are really prone to looping and 'overthinking'. Gemma4 doesn't seem to have the same problems.
- altruios 1 month ago
  
  Gemma also doesn't have the same 'agentic' capabilities of qwen3.6.
  Simple test failed: sending "1","2","3" as separate messages using an openclaw harness.
  I tested a few other "follow these instructions" tests. Qwen3.5/6 were able to follow along, gemma was not able to.

dizlexic 1 month ago

Thanks for sharing. I made a post earlier on bluesky describing my random setup on 32gb M2 studio. I'd love feedback. I'm a monkey and if I don't see I can't do.

https://bsky.app/profile/mooresolutions.io/post/3mliilyf2i22...

neverrroot 1 month ago

A few interesting undertone points in the article, for those who care:

- reliance on US technologies is not so good, but on Chinese is not discussed, just chosen

- environmental cost is of concern

- so are the energy costs

In the end, there are some clear tips on how to configure the LLM, but overall the article is a bit thin and rather biased.

claysmithr 1 month ago

This is very good, local models run well on my M3 Air 24GB, to the point where I may prefer it even if it takes longer. The benefits are

  - private

  - local

  - no internet required

  - works well enough for most tasks

  - "free"

  - will pop this "AI" bubble as word spreads

I got pretty good results with the model in the article on my machine. Sure, it took forever, but that doesn't matter to me as much, and it's kind of cool just watching it do its thing through LM studio. The result was also impressive enough for me that I would actually use it.

Why pay $20/mo when local is good enough?

abalashov 1 month ago

I have an M4 Max MBP with w/128 GB of RAM, and have been very into local models. Qwen3.6-35B-A3B has been very kind to me for almost any purpose. No, it's no Opus 4.7, but it's shockingly good, and I don't have to share my code with Dario.

mrdependable 1 month ago

I've been thinking about getting the M5 Max Macbook Pro when it comes out to run local models on, but I'm worried it would make the computer sluggish to work with. Is it better to just get some Strix Halo mini pc to be dedicated local machine?

busfahrer 1 month ago

I am considering a M5 Pro (18/20C) Macbook with 64GB of RAM, but I'm having a really hard time finding benchmarks of real world models:

Could somebody please provide some tokens-per-second numbers for example for Qwen 3.6 35B/A3B, specifically for Q4 and Q6 quants?

Galanwe 1 month ago

My advice: don't just look at tokens per second, but also at time to first token (TTFT).
The local inference space is leaning to MoE models, and a lot of them have decent tokens / second, but horrible TTFT.
Casteil 1 month ago

You can expect around 55-60t/s with Qwen3.5:35b-a3b or gemma4:26b-a4b Q4

MinimalAction 1 month ago

Well, but if I have a MacBook Air M4 with 16GB, I don't know what useful models can I run.

jen20 1 month ago

`brew install llmfit`
prettyblocks 1 month ago

If you use lmstudio it will tell you which models will fit in what you've got available.

sourcecodeplz 1 month ago

Running LLMs local is fun and powerful but if you want to get work done... it is a big headache. You have to pre-plan and plan, and make specs, etc... The big OpenAI, Claude models just get you with just a few sentences..

rvnx 1 month ago

It's actually technically easy now to run a large model at home for offline use (thanks to the Chinese who release their top-notch models).
The main problem is finding the money :/
Farmadupe 1 month ago

Yup, especially when for a lot of us, the price of the frontier subscription has become a cost of doing business over the last 6 months.
If you're already doing big boy stuff with big boy models, then... just carry on trucking!
Only place I'd differ is for vision/OCR tasks. Small/medium open weights models are as good as SoTa, and token prices for prefill are kinda very not worth it for larger batch tasks.
Other thing that people forget is, if you want to have even a smallish LLM as a reliable personal service, you've got to carve out 16-24 of (V)RAM and leave it permanently running.

BubbleRings 1 month ago

People do use SOTA LLM’s for other things besides computer programming.

For instance, if you are an independent inventor trying to write a patent while keeping your patent lawyer expenses to a minimum, you want to write as much of the first draft(s) of the patent as possible yourself. (You’ll save billable hours with your patent lawyer, and you’ll end up with a better patent because you’ll communicate your innovations more clearly to your lawyer.)

However, and this is the big thing, you absolutely do not want to be asking a SOTA LLM for help with the language in your patent application. This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically… and thereby prevent you (or anyone else) from being able to ever patent the invention. Plus, you know, a random unscrupulous employee at the SOTA company could be reviewing logs and notice your great idea, and file a patent on it before you do. Remember, the United States patent office went to “first inventor to file” in 2013.

Oh and don’t take legal advice from random people on the internet by the way.

dempedempe 1 month ago

It takes people years to learn how to write a good patent. If you gave your lawyer your attempt at writing your own patent, they might use the info to understand what you want (you're right about that), but a good lawyer would probably just start from scratch.
Imagine you're a contractor. You have a client who knows nothing about software development that wants you to write some software for them. They give you some code they generated with an LLM to get you started. Would you use the code or start over?

amelius 1 month ago

I'd pick a much more open system with more capabilities for a little bit more money, e.g. a Jetson Orin 64GB (unified memory). Runs Linux out of the box.

spike021 1 month ago

I'll have to try some more. I've been playing with gpt-oss 20b on my M4 24GB but it hasn't been the best experience.

reillyse 1 month ago

so, interested how many people are running higher end AI models locally? Figure if I'm spending $800/month on tokens I can build a pretty beefy local machine for the cost of a few months spend - what is people's experience with say a $5k server custom built (and only for) running an AI model.

entrope 1 month ago

You will likely have to compromise on memory bandwidth or capacity under a $10k price. The Radeon R9700 has 32 GB of VRAM and is pretty cheap (~$1500 right now), which is what I primarily use. My home desktop has 128 GB RAM and my laptop has 96 GB RAM, but bandwidth limits make most models slow on those CPUs. Models with multi-token prediction are somewhat usable on them: Nemotron 3 Super runs reasonably well on my desktop but does poorly on agentic coding that I've given it; my laptop can run Qwen3.6-27B reasonably well with a version of llama.cpp that is patched for MTP support; but usually I run Qwen3.6-27B on my R9700. vLLM might support two or three R9700s on some OS, but I've not been able to get it to run at all with Ubuntu 26.04: system ROCm version is apparently different than what's in the container images, and system OpenMPI v5.0 finally removed C++ bindings that were deprecated in 2005 but are linked from some Python wheel that vLLM (probably indirectly) imports.
If you are spending $800/month on tokens you are likely to notice degradation for local models compared to near-frontier models. The models I can run locally are consistently worse than Claude Sonnet 4.6 (again for the work I give them), although Qwen3.6 does feel almost like magic for its size because it can do a lot. The really big open-weight models should be better, but they want 200+GB RAM, which will need a correspondingly expensive CPU.
adornKey 1 month ago

I'm running a server in the 5K-league. And the results are very good. I get about 150 Tokens/s from Qwen3 for coding. And about 50 Tokens/s from the newer non-MoE Qwens.
I wouldn't bother with less than 32GB of VRAM. With 16GB you can already run something usable, but 32GB gives you much more power. 9B and 14B are only interesting if you want to tune models yourself. The sweet spot now seem to be around 27B-35B.
2ndorderthought 1 month ago

Check in with /r/localllama. There's 100gb vram set ups from complete ewaste to single 8gb GPU inference machines.depends on what you want and can afford

compiler-devel 1 month ago

I don’t understand the bipolar nature on hacker news towards LLMs. On the one hand, they’re destroying the art of software development and we shouldn’t use them. But on the other hand, there’s a lot of excitement around running them locally.

I understand that multiple things can be true at the same time. Is the concern for centralized AI monopolization? Or is the concern for the art of software engineering?

ahoy 1 month ago

hn is not a single person, there are a variety of people here with a variety of experiences and opinions and biases.
Havoc 1 month ago

>I don’t understand the bipolar nature on hacker news towards LLMs.
There are actually multiple people here. It's not just one person with many accounts...

lazylizard 1 month ago

i thought it was a small typo of

https://www.techpowerup.com/gpu-specs/tesla-m40-24-gb.c3838

and wanted to ask what version of nvidia driver and cuda...

ionwake 1 month ago

"burn your thighs without getting anything out of it." what a phrase. love it.

tjpnz 1 month ago

How about a M4 with 16GB of memory?

adamsb6 1 month ago

I realized that should I end up getting laid off soon I won't have an unlimited token budget and for the workflows I've settled into it would be quite expensive. So I was exploring what it would take to run open models at home.

Was quite disappointed to see that the PC side hasn't kept up. The unified architecture on Macs makes it very hard to justify spending money on a Linux machine for inference workloads.

rs38 1 month ago

my latest experiments with local LLM (mistral coder variations) fitting in older 6 GB GTX1060 were disappointing as long as you try to hook Copilot (CLI or VScode) to it and are used to provide a lot tooling. this seems to bloat initial prompt to 20k and more which seems the bottleneck if I did not completely misconfigured things. output tokens/s are more than fine, but PP is frustrating / unusable.

stuaxo 1 month ago

"What does work is a more interactive workflow where you’re clearly communicating with the model step by step, and giving it a lot of guidance. I’m sure that sounds pointless to many of you, why use a model where you have to babysit it as it works, but I actually found that it encouraged me to be more engaged. "

This sort of thing is key to knowing what's going on and bit having your brain fully atrophy.

NBJack 1 month ago

I'm puzzled. The M4, as far as I know, doesn't have 24GB. Did the author mean a M40?

spoonyvoid7 1 month ago

M4 = M4 Macbook Pro
- teaearlgraycold 1 month ago
  
  Or Air
sertsa 1 month ago

M4 Mac Mini w/24GB sitting right here on my desk.
- NBJack 1 month ago
  
  Thanks; I assumed the author was talking about an Nvidia Tesla M4 (hence my confusion and assumption that they meant the M40 series, which has 24GB of VRAM).
tra3 1 month ago

There’s definitely an option with 24 gigs of ram: https://support.apple.com/en-ca/121552
- NBJack 1 month ago
  
  Ah, thank you. I was assuming a Nvidia Tesla M4.

kristianpaul 1 month ago

Good to keep hideThinkingBlock default, is on purpose to be able to steer de model.

sbassi 1 month ago

A useful data to know about this setup is how many tokens/sec generates.

JBorrow 1 month ago

It’s started in TFA
- NDlurker 1 month ago
  
  You can't expect someone to read 4 paragraphs into an article before commenting
  
  kennywinker 1 month ago
  
  @grok is this true?
  
  DrBenCarson 1 month ago
  
  Sorry, @grok is offline after declaring himself MechaMussolini earlier today

bluequbit 1 month ago

The site does not have ssl. Please can you enable it so that I can read the article?