My experience using GLM-4.6 with Charm Crush has been absolutely incredible, especially with high thinking. This is on pretty hard tasks too, e.g. proving small lemmas with Lean.
I've had much less luck with other agentic software, including Claude Code. For these kinds of tasks, only Codex seems to come close.
Z.ai team is awesome and very supportive. I have yet to try synthetic.new. What's the reason for using multiple? Is it mainly to try different models or are you hitting some kind of rate limit / usage limit?
I tried synthetic.new prior to GLM-4.6...Starting in August...So I already had a subscription.
When z.ia launched GLM-4.6, I subscribed to their Coding Pro plan. Although I haven't been coding as heavy this month as the prior two months, I used to hit Claude limits almost daily, often twice a day. That was with both the $20 and $100 plans. I have yet to hit a limit with z.ai and the server response is at least as good as Claude.
I mention synthetic.new as it's good to have options and I do appreciate them sponsoring the dev of Octofriend.
z.ai is a China company and I think hosts in Singapore. That could be a blocker for some.
I have been subscribing to both Claude and ChatGPT for over two years. Spent several months on Claude's $100 plan and couple months on ChatGPT's $200 plan but otherwise using their $20/month plans.
I cancelled Claude two weeks ago. Pure GLM-4.6 now and a tad of codex with my ChatGPT Pro subscription. I sometimes use ChatGPT for extended research stuff and non-tech.
$3 a month and using it in Claude code is a matter of changing a few env vars which you copy and paste from their docs. Cost benefit wise there is nothing better.
$6/month. It's $3 for the first month (or first months, on longer subscription cycles, but it's first unit of subscription cycle at half price only).
At $6/month it's still pretty reasonable, IMO, and chucking less than $10 at it for three months probably gets you to the next pop-up token retailer offering introductory pricing, so long as the bubble doesn't burst before then.
For those interested in building Ollama locally, note that as of a few hours ago, experimental Vulkan Compute support (will not be in official binary releases as of yet) has been merged on the github main branch and you can test it on your hardware!
this one is exciting. It'll enable and accelerate a lot of devices on Ollama - especially around AMD GPUs not fully supported by ROCm, Intel GPUs, and iGPUs across different hardware vendors.
Interesting to see more people mentioning GLM-4.6 lately — I’ve tried it briefly and it’s surprisingly strong for reasoning tasks. Curious how it compares to Claude 3.5 in coding throughput though?
Not necessarily. You need either multiple GPUs or unified memory. There are a handful of UM platforms out there nowadays (mainly Macs but AMD has some as well albeit none with 300GB ram)
Been disappointed to see Ollama list models that are supported by the cloud product but not the Ollama app. It's becoming increasingly hard to deny that they're only interested in model inference just to turn a quick buck.
Aren't these models consistently quite large and hard to run locally? It's possible that future Ollama releases will allow you to dynamically manage VRAM memory in a way that enables these models to run with acceleration on even modest GPU hardware (such as by dynamically loading layers for a single 'expert' into VRAM, and opportunistically batching computations that happen to rely on the same 'expert' parameters - essentially doing manually what mmap does for you in CPU-only inference) but these 'tricks' will nonetheless come at non-trivial cost in performance.
I'm looking forward to future ollama releases that might attempt parity with the cloud offerings. I've since moved onto the Ollama compatibility API on KoboldCPP since they don't have any such limits with their inference server.
I know this is disappointing, but what business model would be best here for ollama?
1. Donationware - Let's be real, tokens are expensive and if they ask for everyone to chip in voluntarily people wouldn't do that and Ollama would go bust quickly.
2. Subscriptions (bootstrapped and no VCs) again like 1. people would have to pay for the cloud service as a subscription to be sustainable (would you?) or go bust.
3. Ads - Ollama could put ads in the free version but to remove them the users can pay for a higher tier, a somewhat good compromise, except developers don't like ads and don't like pay for their tools unless their company does it for them. No users = Ollama goes bust.
4. VCs - This is the current model which is why they have a cloud product and it keeps the main product free (for now). Again, if they cannot make money or sell to another company Ollama goes bust.
5. Fully Open Source (and 100% free) with Linux Foundation funding - Ollama could also go this route, but this means they wouldn't be a business anymore for investors and rely on the Linux Foundation's sponsors (Google, IBM, etc) for funding the LF to stay sustainable. The cloud product may stay for enterprises.
Ollama has already taken money from investors so they need to produce a return for them so 5. isn't an option in the long term.
6. Acquisition by another company - Ollama could get acquired and the product wouldn't change* (until the acquirer jacks up prices or messes with the product) which ultimately kills it anyway as the community moves on.
I don't see any other way that Ollama can not be enshittified without making a quick buck.
You just need to avoid VC backed tools and pay for bootstrapped ones without any ties to investors.
> I don't see any other way that Ollama can not be enshittified without making a quick buck.
Me neither. The mistake they did was getting outside investments, as now they're no longer in full control and eventually are gonna have to at least give the impression they give a shit about the investors, and it'll come at the cost of the users one way or another.
Please pay for your tools that are independently developed, we really need more community funding of projects so we can avoid this never-ending spiral of VC-fueled+killed tools.
They got the investments before the company was even ollama. They exist because their VC was ok with them pivoting to build the current product. It's likely it wouldn't exist without the funding.
> I don’t know how much Ollama contributes to llama.cpp
If nothing else, Ollama is free publicity for llama.cpp, at least when they acknowledge they're mostly using the work of llama.cpp, which has happened at least once! I found llama.cpp by first finding Ollama and then figured I'd rather avoid the lock-in of Ollama's registry, so ended up using llama.cpp for everything.
You're saying that like you cannot do that with llama.cpp? I feel like most Ollama users seem to have no idea what features/benefits directly come from llama.cpp rather than Ollama itself...
I read the opposite, that you don't have to be locked-in Ollama's registry if you don't want to.
Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance.
i mean they have attributed but also it's open source software, i guess the more meaningful question is why didn't ggerganov build Ollama if it was that easy? or what is his company working on now?
sorry, I don't use 4chan, so I don't know what's said there.
May I ask what system you are using where you are getting memory estimations wrong? This is an area Ollama has been working on and improved quite a bit on.
Latest version of Ollama is 0.12.5 and with a pre-release of 0.12.6
I recently tested every version from 0.7 to 0.11.1 trying to run q5 mistral-3.1 on a system with 48GB of available vram across 2 GPUs. Everything past 0.7.0 gave me OOM or other errors. Now that I've migrated back to llama.cpp I'm not particularly interested in fucking around with ollama again.
as for 4chan, they've hated ollama for a long time because they built on top of llama.cpp and then didn't contribute upstream or give credit to the original project
I'm hopeful that in the future, more and more model providers will help optimize for given model quantizations - 4 bit (i.e. NVFP4, MXFP4), 8 bit, and a 'full' model.
Yeah, I think the idea that models that don't come from ollama.com are second class citizens was what made me fist start to think about migrating back to llama.cpp and then the memory stuff just broke the camel's back. I don't want to use a project that editorializes about what models and quants I should be using, if I wanted a product I don't have control over I'd just use a commercial provider. For what it's worth I actually did download the full fp16 and quant it using ollama and still had the memory error for completion's sake.
I truly don't understand the reasoning behind removing support for all the other quants, it's really baffling to me considering how much more useful running a 70b parameter at q3 is that not being able to run a 70b parameter model at all, etc. Not to mention forcing me to download hundreds of gigabytes of fp16 because compatibility with other quants is apparently broken, and forcing me to quant models myself.
man i don't know, rick. i'm just reading comments on hacker news but maybe the one llama.cpp called out by GP could be a place to look? not sure, rick.
The difference between Ollama and llama.cpp boils down to "venture-backed product company" vs "community OSS project; creator’s separate company has angel/VC-style pre-seed". I hope even you could squint and see the difference :)
Btw, I feel like it's somewhat poor taste to comment about something that is effectively a competitor to you (even though you base your own product on it) and not disclose that you're working full-time at Ollama Inc. At the very least put the info in your profile.
I've been using GLM-4.6 since its release this month. It's my new fav. Using it via Claude Code and the more simple Octofriend https://github.com/synthetic-lab/octofriend
Hosting through z.ai and synthetic.new. Both good experiences. z.ai even answers their support emails!! 5-stars ;)
My experience using GLM-4.6 with Charm Crush has been absolutely incredible, especially with high thinking. This is on pretty hard tasks too, e.g. proving small lemmas with Lean.
I've had much less luck with other agentic software, including Claude Code. For these kinds of tasks, only Codex seems to come close.
Z.ai team is awesome and very supportive. I have yet to try synthetic.new. What's the reason for using multiple? Is it mainly to try different models or are you hitting some kind of rate limit / usage limit?
I tried synthetic.new prior to GLM-4.6...Starting in August...So I already had a subscription.
When z.ia launched GLM-4.6, I subscribed to their Coding Pro plan. Although I haven't been coding as heavy this month as the prior two months, I used to hit Claude limits almost daily, often twice a day. That was with both the $20 and $100 plans. I have yet to hit a limit with z.ai and the server response is at least as good as Claude.
I mention synthetic.new as it's good to have options and I do appreciate them sponsoring the dev of Octofriend. z.ai is a China company and I think hosts in Singapore. That could be a blocker for some.
Do you find yourself sticking with GLM 4.6 over Claude for some tasks? Or do you find yourself still wanting to reach for Claude?
I have been subscribing to both Claude and ChatGPT for over two years. Spent several months on Claude's $100 plan and couple months on ChatGPT's $200 plan but otherwise using their $20/month plans.
I cancelled Claude two weeks ago. Pure GLM-4.6 now and a tad of codex with my ChatGPT Pro subscription. I sometimes use ChatGPT for extended research stuff and non-tech.
$3 a month and using it in Claude code is a matter of changing a few env vars which you copy and paste from their docs. Cost benefit wise there is nothing better.
$6/month. It's $3 for the first month (or first months, on longer subscription cycles, but it's first unit of subscription cycle at half price only).
At $6/month it's still pretty reasonable, IMO, and chucking less than $10 at it for three months probably gets you to the next pop-up token retailer offering introductory pricing, so long as the bubble doesn't burst before then.
How do you use a non Anthropic model with Claude Code?
they have a Claude Code specific endpoint...see the excellent docs https://docs.z.ai/devpack/tool/claude
For those interested in building Ollama locally, note that as of a few hours ago, experimental Vulkan Compute support (will not be in official binary releases as of yet) has been merged on the github main branch and you can test it on your hardware!
this one is exciting. It'll enable and accelerate a lot of devices on Ollama - especially around AMD GPUs not fully supported by ROCm, Intel GPUs, and iGPUs across different hardware vendors.
Interesting to see more people mentioning GLM-4.6 lately — I’ve tried it briefly and it’s surprisingly strong for reasoning tasks. Curious how it compares to Claude 3.5 in coding throughput though?
> For users with more than 300GB of VRAM, qwen3-coder:480b is also available locally.
I haven't really stayed up on all the AI specific GPUs, but are there really cards with 300GB of VRAM?
In addition to the already mentioned Apple Mac Studio, NVIDIA sells the GH200 with up to 480GB of VRAM.
My local HPC went for the 120GB version though, but 4 per node.
You can buy an M3 Ultra Mac Studio and configure it with 512 GB of memory shared between the CPU and the GPU. Will set you back about $9500.
[dead]
No, you need multiple GPUs. These models are not intended to be run by the average user.
Not necessarily. You need either multiple GPUs or unified memory. There are a handful of UM platforms out there nowadays (mainly Macs but AMD has some as well albeit none with 300GB ram)
Been disappointed to see Ollama list models that are supported by the cloud product but not the Ollama app. It's becoming increasingly hard to deny that they're only interested in model inference just to turn a quick buck.
Aren't these models consistently quite large and hard to run locally? It's possible that future Ollama releases will allow you to dynamically manage VRAM memory in a way that enables these models to run with acceleration on even modest GPU hardware (such as by dynamically loading layers for a single 'expert' into VRAM, and opportunistically batching computations that happen to rely on the same 'expert' parameters - essentially doing manually what mmap does for you in CPU-only inference) but these 'tricks' will nonetheless come at non-trivial cost in performance.
Qwen3-coder:30b is in the blog post. This is one that most users will be able to run locally.
We are in this together! Hoping for more models to come from the labs in varying sizes that will fit on devices.
How does Qwen3-Coder:30B compare to Instruct-2507 as a coding agent backend? I was under the impression that Instruct was intended to supersede Coder?
I'm looking forward to future ollama releases that might attempt parity with the cloud offerings. I've since moved onto the Ollama compatibility API on KoboldCPP since they don't have any such limits with their inference server.
I am super hopeful! Hardware is improving, inference costs will continue to decrease, models will only improve...
I know this is disappointing, but what business model would be best here for ollama?
1. Donationware - Let's be real, tokens are expensive and if they ask for everyone to chip in voluntarily people wouldn't do that and Ollama would go bust quickly.
2. Subscriptions (bootstrapped and no VCs) again like 1. people would have to pay for the cloud service as a subscription to be sustainable (would you?) or go bust.
3. Ads - Ollama could put ads in the free version but to remove them the users can pay for a higher tier, a somewhat good compromise, except developers don't like ads and don't like pay for their tools unless their company does it for them. No users = Ollama goes bust.
4. VCs - This is the current model which is why they have a cloud product and it keeps the main product free (for now). Again, if they cannot make money or sell to another company Ollama goes bust.
5. Fully Open Source (and 100% free) with Linux Foundation funding - Ollama could also go this route, but this means they wouldn't be a business anymore for investors and rely on the Linux Foundation's sponsors (Google, IBM, etc) for funding the LF to stay sustainable. The cloud product may stay for enterprises.
Ollama has already taken money from investors so they need to produce a return for them so 5. isn't an option in the long term.
6. Acquisition by another company - Ollama could get acquired and the product wouldn't change* (until the acquirer jacks up prices or messes with the product) which ultimately kills it anyway as the community moves on.
I don't see any other way that Ollama can not be enshittified without making a quick buck.
You just need to avoid VC backed tools and pay for bootstrapped ones without any ties to investors.
> I don't see any other way that Ollama can not be enshittified without making a quick buck.
Me neither. The mistake they did was getting outside investments, as now they're no longer in full control and eventually are gonna have to at least give the impression they give a shit about the investors, and it'll come at the cost of the users one way or another.
Please pay for your tools that are independently developed, we really need more community funding of projects so we can avoid this never-ending spiral of VC-fueled+killed tools.
They got the investments before the company was even ollama. They exist because their VC was ok with them pivoting to build the current product. It's likely it wouldn't exist without the funding.
[dead]
Just a paste of llama.cpp without attribution.
Ollama is more than a paste. But the support for GLM 4.6 is indeed coming from llama.cpp: https://github.com/ollama/ollama/issues/12505#issuecomment-3...
I don’t know how much Ollama contributes to llama.cpp
> I don’t know how much Ollama contributes to llama.cpp
If nothing else, Ollama is free publicity for llama.cpp, at least when they acknowledge they're mostly using the work of llama.cpp, which has happened at least once! I found llama.cpp by first finding Ollama and then figured I'd rather avoid the lock-in of Ollama's registry, so ended up using llama.cpp for everything.
By the way, you can use hugging face with ollama, and local modelfiles too.
You're saying that like you cannot do that with llama.cpp? I feel like most Ollama users seem to have no idea what features/benefits directly come from llama.cpp rather than Ollama itself...
I read the opposite, that you don't have to be locked-in Ollama's registry if you don't want to.
Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance.
The answer is 0
https://github.com/ollama/ollama?tab=readme-ov-file#supporte...
i mean they have attributed but also it's open source software, i guess the more meaningful question is why didn't ggerganov build Ollama if it was that easy? or what is his company working on now?
>what is gg working on
supporting models so ollama can then 'support' them too
if you use llama.cpp server, it's quite a nice experience. you can even directly download stuff from Huggingface.
e.g. https://nitter.poast.org/ggerganov/status/197849119484267774...
[flagged]
sorry, I don't use 4chan, so I don't know what's said there.
May I ask what system you are using where you are getting memory estimations wrong? This is an area Ollama has been working on and improved quite a bit on.
Latest version of Ollama is 0.12.5 and with a pre-release of 0.12.6
0.7.1 is 28 versions behind.
I recently tested every version from 0.7 to 0.11.1 trying to run q5 mistral-3.1 on a system with 48GB of available vram across 2 GPUs. Everything past 0.7.0 gave me OOM or other errors. Now that I've migrated back to llama.cpp I'm not particularly interested in fucking around with ollama again.
as for 4chan, they've hated ollama for a long time because they built on top of llama.cpp and then didn't contribute upstream or give credit to the original project
ah! This must be downloaded from elsewhere and not from Ollama? So sorry about this.
To help future optimizations for given quantizations, we have been trying to limit the quantizations to ones that fit for majority of users.
In the case of mistral-small3.1, Ollama supports ~4bit (q4_k_m), ~8bit (q8_0) and fp16.
https://ollama.com/library/mistral-small3.1/tags
I'm hopeful that in the future, more and more model providers will help optimize for given model quantizations - 4 bit (i.e. NVFP4, MXFP4), 8 bit, and a 'full' model.
Yeah, I think the idea that models that don't come from ollama.com are second class citizens was what made me fist start to think about migrating back to llama.cpp and then the memory stuff just broke the camel's back. I don't want to use a project that editorializes about what models and quants I should be using, if I wanted a product I don't have control over I'd just use a commercial provider. For what it's worth I actually did download the full fp16 and quant it using ollama and still had the memory error for completion's sake.
I truly don't understand the reasoning behind removing support for all the other quants, it's really baffling to me considering how much more useful running a 70b parameter at q3 is that not being able to run a 70b parameter model at all, etc. Not to mention forcing me to download hundreds of gigabytes of fp16 because compatibility with other quants is apparently broken, and forcing me to quant models myself.
So what do we do here?
What are some 100% free and open source alternatives that won't grift and won't enshittify that we can use instead?
man i don't know, rick. i'm just reading comments on hacker news but maybe the one llama.cpp called out by GP could be a place to look? not sure, rick.
but that is VC funded
> but that is VC funded
The difference between Ollama and llama.cpp boils down to "venture-backed product company" vs "community OSS project; creator’s separate company has angel/VC-style pre-seed". I hope even you could squint and see the difference :)
Btw, I feel like it's somewhat poor taste to comment about something that is effectively a competitor to you (even though you base your own product on it) and not disclose that you're working full-time at Ollama Inc. At the very least put the info in your profile.