The real news is that non-thinking output is now 4x more expensive, which they of course carefully avoid mentioning in the blog, only comparing the thinking prices.
How cute they are with their phrasing:
> $2.50 / 1M output tokens (*down from $3.50 output)
Which should be "up from $0.60 (non-thinking)/down from $3.50 (thinking)"
I have LLM fatigue, so I'm not paying attention to headlines... but LLMs are thinking now? That used to be a goal post. "AI can't do {x} because it's not thinking." Now it's part of a pricing chart?
"Thinking" means spamming a bunch of stream-of-consciousness bs before it actually generates the final answer. It's kind of like the old trick of prompting to "think step by step". Seeding the context full of relevant questions and concepts improves the quality of the final generation, even though it's rarely a direct conclusion of the so-called thinking before it.
Gmail was in beta for what, 2 decades? Did you never use it during that time? They've been using these "Preview" models on their non-technical user facing Gemini app and product for months now. Like, Google themselves has been using them in production, on their main apps. And gemini-1.5-pro is 2 months from depreciation and there was no production alternative.
They told everyone to build their stuff on top of it, and then jacked up the price by 4x. Just pointing to some fine print doesn't change that.
Correct, though pretty much anything end-user facing is latency-sensitive, voice is a tiny percentage. No one likes waiting, the involvement of an LLM doesn't change this from a user PoV.
I wonder if you can hide the latency, especially for voice?
What I have in mind is to start the voice response with a non-thinking model, say a sentence or two in a fraction of a second. That will take the voice model a few seconds to read out. In that time, you use a thinking model to start working on the next part of the response?
In a sense, very similar to how everyone knows to stall in an interview by starting with 'this is a very good question...', and using that time to think some more.
Not at all. Non-thinking flash is... flash with the thinking budget set to 0 (which you can still run that way, just at 2x input 4x output pricing). Flash-lite is far weaker, unusable for the overwhelming majority of usecases of flash. A quick glance at the benchmark reveals this.
“While we strive to maintain consistent pricing between preview and stable releases to minimize disruption, this is a specific adjustment reflecting Flash’s exceptional value, still offering the best cost-per-intelligence available.”
Pricing is a hard problem. Theoretically, if companies occasionally raise prices dramatically once something is useful, they sometimes can create early demand and more testers for future product releases. Ofc they have to be careful to avoid annoying regular users too much. When you sell the harm is limited to late users, but when you rent it is harder to figure out the optimal strategy.
Not too long ago Google was a bit of a joke in AI and their offerings were uncompetitive. For a while a lot of their preview/beta models had a price of 0.00. They were literally giving it away for free to try to get people to consider their offerings when building solutions.
As they've become legitimately competitive they have moved towards the pricing of their competitors.
I still don’t think there’s any real stickiness to using a Google model over any other model, with things like openrouter. So maybe for brand recognition alone.
Yea, but brand have some stickiness. Maybe not for the absolute nerds but lots of people just stick to what they are already using. Look at all the people just using ChatGPT because that is what they tried first.
By then comparable or even better models will easily run on edge.
So if they crank up the prices we could just switch to local and not get lured by bigger and bigger models, rag, Agentic, MCP driven tech as if all of that couldn't run locally either.
I am not as optimistic that locally run models will be able to compete anytime soon. And even if, the price to run them means you have to buy the compute/gear for a price that is likely equivalent to a lot of 'remote' tokens
Presumably your goal is to extract some practical value from this and not just higher benchmark numbers. If you can get the functionality you need from last-gen, there's no point in paying for next-gen. YMMV.
Entering the market and being competitive gets more difficult all the time. People want the best and fastest models - can you compete with trillion dollar datacenters?
I know they undercutting the price a lot, because at first launch gemini price is not make sense seeing it cheaper than competition (like a lot cheaper)
It might be a bit confusing, but there's no "only thinking flash" - it's a single model, and you can turn off thinking if you set thinking budget to 0 in the API request. Previously 2.5 Flash Preview was much cheaper with the thinking budget set to 0, now the price is the same. Of course, with thinking enabled the model will still use far more output tokens than the non-thinking mode.
Apparently, you can make a request to 2.5 flash to not use thinking, but it will still sometimes do it anyways, this has been an issue for months, and hasn't been fixed by model updates: https://github.com/google-gemini/cookbook/issues/722
At one point, when they made Gemini Pro free on AI Studio, Gemini was the model of choice for many people, I believe.
Somehow it's gotten worse since then, and I'm back to using Claude for serious work.
Gemini is like that guy who keeps talking but has no idea what he's actually talking about.
I still use Gemini for brainstorming, though I take its suggestions with several grains of salt. It's also useful for generating prompts that I can then refine and use with Claude.
I use only the APIs directly with Aider (so no experience with AI Studio).
My feeling with Claude is that they still perform good with weak prompts, the "taste" is maybe a little better when the direction is kinda unknown by the prompter.
When the direction is known I see Gemini 2.5 Pro (with thinking) on top of Claude with code which does not break. And with o4-mini and o3 I see more "smart" thinking (as if there is a little bit of brain inside these models) at the expense of producing unstable code (Gemini produces more stable code).
I see problems with Claude when complexity increases and I would put it behind Gemini and o3 in my personal ranking.
So far I had no reason to go back to Claude since o3-mini was released.
I just spent $35 for Opus to solve a problem with a hardware side-project (I'm turning an old rotary phone into a meeting handset so I can quit meetings by hanging up, if you must know). It didn't solve the problem, it churned and churned and spent a ton of money.
I was much more satisfied with o3 and Aider, I haven't tried them on this specific problem but I did quite a bit of work on the same project with them last night. I think I'm being a bit unfair, because what Claude got stuck on seems to be a hard problem, but I don't like how they'll happily consume all my money trying the same things over and over, and never say "yeah I give up".
When I obtain results from one paid model that are significantly better than what I previously got from another paid model, I'll typically give a thumbs-down to the latter and point out in the comment that it was beaten by a competitor. Can't hurt.
Using all of the popular coding models pretty extensively over the past year, I've been having great success with Gemini 2.5 Pro as far as getting working code the first time, instruction following around architectural decisions, and staying on-task. I use Aider and write mostly Python, JS, and shell scripts. I've spent hundreds of dollars on the Claude API over time but have switched almost entirely to Gemini. The API itself is also much more reliable.
My only complaint about 2.5 Pro is around the inane comments it leaves in the code (// Deleted varName here).
If you use one of the AI static instructions methods (e.g., .github/copilot-instructions.md) and tell it to not leave the useless comments, that seems to solve the issue.
I've been intending to try some side by side tests with and without a conventions file instructing it not to leave stupid comments—I'm curious to see if somehow they're providing value to the model, e.g. in multi-turn edits.
it's easier to just make it do a code review with focus on removing unhelpful comments instead of asking it not to do it the first time. I do the cleanup after major rounds of work and that strategy seems to work best for me.
This was not my experience with the earlier preview (03), where its insistence on comment spam was too strong to overcome. Wonder if this adherence improved in the 05 or 06 updates.
I don't mind the comments, I read them while removing them.
It's normal to have to adapt the output, change some variable names, refactor a bit. What's impressive is that the output code actually works (or almost).
I didn't give it the hardest of problems to solve/code but certainly not easy ones.
I'm using pro for backend and claude for ux work, claude is so much thoughtful about how user interact with software and can usually replicate better the mock up that gpt4o image generator produces, while not being overly fixated on the mockup design itself.
My complaint is that it catches python exceptions and don't log them by default.
You feelings of a little brain in there, and stable code are unfounded. All these models collapse pretty fast. If not due to context limit, then in their inability to interpret problems.
An LLM is just statistical regressions with a llztjora of engineering tricks, mostly NLP to produce an illusion.
I don't mean it's useless. I mean comparing these ever evolving models is like comparing escort staff in NYC vs those in L.A, hard to reach any conclusjon. We are getting fooled.
On the price increase, it seems Google was aggressively looking for adoption, Gemini was for a short range of time the best value for money of all the LLMs out there. Adoption likely surged, scaling needs be astronomical, costing Google billions to keep up. The price adjustment could've been expected before they announced it.
Yea, i had similar experiences. At first it felt like it solved complex problems really well, but then i realized i was having trouble steering it for simple things. It was also very verbose.
Overall though my primary concern is the UX, and Claude Code is the UX of choice for me currently.
Same experience here. I even built a Gem with am elaborate prompt instructing it how to be concise, but it still gives annoying long-winded responses and frequently expands the scope of its answer far beyond the prompt.
I feel like this is part of the AI playbook now. Launch a really strong, capable model (expensive price inference) and once users think it’s SOTA, neuter it so the cost is cheaper and most users won’t notice.
The same happened with GPT-3.5. It was so good early on and got worse as OpenAI began to cut costs. I feel like when GPT-4.1 was cloaked as Optimus on Openrouter, it was really good, but once it launched, it also got worse.
I disagree with the comparison between LLM behavior and traditional software getting worse. When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals. Companies often don’t bother hiding it, since their users are typically locked into their ecosystem.
LLMs, on the other hand, operate under different incentives. It’s in a company’s best interest to initially release the strongest model, top the benchmarks, and then quietly degrade performance over time. Unlike traditional software, LLMs have low switching costs, users can easily jump to a better alternative. That makes it more tempting for companies to conceal model downgrades to prevent user churn.
> When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals.
Counterexample: 99% of average Joes have no idea how incredibly enshittified Google Maps has become, to just name one app. These companies intentionally boil the frog very slowly, and most people are incredibly bad at noticing gradual changes (see global warming).
Sure, they could know by comparing, but you could also know whether models are changing behind the scenes by having sets of evals.
This is where switching costs matter. Take Google Maps, many people can’t switch to another app. In some areas, it’s the only app with accurate data, so Google can degrade the experience without losing users.
We can tell it’s getting worse because of UI changes, slower load times, and more ads. The signs are visible.
With LLMs, it’s different. There are no clear cues when quality drops. If responses seem off, users often blame their own prompts. That makes it easier for companies to quietly lower performance.
That said, many of us on HN use LLMs mainly for coding, so we can tell when things get worse.
Both cases involve the “boiling frog” effect, but with LLMs, users can easily jump to another pot. With traditional software, switching is much harder.
Do you mind explaining how you see this working as a nefarious plot? I don't see an upside in this case so I'm going with the old "never ascribe to malice" etc
I have no inside information but feels like they quantized it. I've seen patterns that I usually only see in quantized models like getting stuck repeating a single character indefinitely
They should just roll back to the preview versions. Those were so much more even keeled and actually did some useful pushback instead of this cheerleader-on-steroids version they GA'd.
Yes I was very surprised after the whole "scandal" around ChatGPT becoming too sycophantic that there was this massive change in tone from the last preview model (05-06) to the 06-05/GA model. The tone is really off-putting, I really liked how the preview versions felt like intelligent conversation partners and recognize what you're saying about useful pushback - it was my favorite set of models (the few preview iterations before this one) and I'm sad to see them disappearing.
Many people on the Google AI Developer forums have also noted either bugs or just performance regression in the final model.
I don't know but it sure doesn't feel the same. I have been using Gemini 2.5 pro (preview and now GA) for a while. The difference in tone is palpable. I also noticed that the preview took longer time and the GA is faster so it could be quantization.
Maybe a bunch of people with authority to decide thought that it was too slow/expensive/boring and screwed up a nice thing.
I found Gemini now terrible for coding. I gave it my code blocks and told it what to change and it added tonnes and tonnes of needles extra code plus endless comments. It turned a tight code into a Papyrus.
ChatGPT is better but tends to be too agreeable, never trying to disagree with what you say even if it's stupid so you end up shooting yourself in the foot.
Used to be able to use Gemini Pro free in cline. Now the API limits are so low that you immediately get messages about needing to top up your wallet and API queries just don't go through. Back to using DeepSeek R1 free in cline (though even that eventually stops after a few hours and you have to wait until the next day for it to work again). Starting to look like I need to setup a local LLM for coding - which means it's time to seriously upgrade my PC (well, it's been about 10 years so it was getting to be time anyway)
By the time you breakeven on whatever you spend on a decent LLM capable build, your hardware will be too far behind to run whatever is best locally then. It's something that feels cheaper but with the pace of things, unless you are churning an insane amount of tokens, probably doesn't make sense. Never mind that local models running on 24 or 48GB are maybe around flash-lite in ability while being slower than SOTA models.
Local models are mostly for hobby and privacy, not really efficiency.
Same for me. I've been using Gemini 2.5 Pro for the past week or so because people said Gemini is the best for coding! Not at all my experience with Gemini 2.5 Pro, on top of being slow and flaky, the responses are kind of bad. Claud Sonnet 4 is much better IMO.
They nerfed Pro 2.5 significantly in the last few months. Early this year, I had genuinely insightful conversations with Gemini 2.5 Pro. Now they are mostly frustrating.
I also have a personal conspiracy theory, i.e., that once a user exceeds a certain use threshold of 2.5 Pro in the Google Gemini app, they start serving a quantized version. Of course, I have no proof, but it certainly feels that way.
I think you are right and this is probably the case.
Although, given that I rapidly went from +4 to 0 karma, a few other comments in this topic are grey, and at least one is missing, I am getting suspicious. (Or maybe it is just lunch time in MTV.)
There was a significant nerf of Gemini 3-25 a little while ago, so much so that I detected it without knowing there was even a new release.
Totally convinced they quantized the model quietly and improved on the coding benchmark to hide that fact.
I’m frankly quite tired of LLM providers changing the model I’m paying for access to behind the scenes, often without informing me, and in Gemini’s case on the API too—at least last time I checked they updated the 3-25 checkpoint to the May update.
One of the early updates improved agentic coding scores while lowering other general benchmark scores, which may have impacted those kind of conversations.
I am very impressed with Gemini and stopped using OpenAI. Sometimes, I ping all three major models on OpenRouter but 90% is on Gemini now. Compare that to 90% ChatGPT last year.
Also me. Still pay for OpenAI, I use gpt4 for excel work and is super fast and able to do more excel related work like combine files that come up often for projects I work on.
I don't like the thinking time, but for coding, journaling, and other stuff I've often been impressed with Gemini Pro 2.5 out of the box.
Possibly I could do much more prompt fine-tuning to nudge openai/anthropic in the direction I want, but with the same prompts Gemini often gives me answers/structure/tone I like much better.
Example: I had claude 3.7 generating embedding images and captions along with responses. Same prompt into Gemini it gave much more varied and flavorful pictures.
Love to see it, this takes Flash Lite from "don't bother" territory for writing code to potentially useful. (Besides being inexpensive, Flash Lite is fast -- almost always sub-second, to as low as 200ms. Median around 400ms IME.)
Brokk (https://brokk.ai/) currently uses Flash 2.0 (non-Lite) for Quick Edits, we'll evaluate 2.5 Lite now.
ETA: I don't have a use case for a thinking model that is dumber than Flash 2.5, since thinking negates the big speed advantage of small models. Curious what other people use that for.
Curious to hear what folks are doing with Gemini outside of the coding space and why you chose it. Are you building your app so you can swap the underlying GenAI easily? Do you "load balance" your usage across other providers for redundancy or cost savings? What would happen if there was ever some kind of spot market for LLMs?
In my experience, Gemini 2.5 Pro really shines in some non-coding use cases such as translation and summarization via Canvas. The gigantic context window and large usage limits help in this regard.
I also believe Gemini is much better than ChatGPT in generating deep research reports. Google has an edge in web search and it shows. Gemini’s reports draw on a vast number of sources, thus tend to be more accurate. In general, I even prefer its writing style, and I like the possibility of exporting reports to Google Docs.
One thing that I don’t like about Gemini is its UI, which is miles behind the competition. Custom instructions, projects, temporary chats… these things either have no equivalent in Gemini or are underdeveloped.
If you're a power user, you should probably be using Gemini through AI studio rather than the "basic user" version. That allows you to set system instructions, temperature, structured output, etc. There's also NotebookLM. Google seems to be trying to make a bunch of side projects based on Gemini and seeing what sticks, and the generic gemini app/webchat is just one of those.
My complaint is that any data within AI Studio can be kept by Google and used for training purposes — even if using the paid tier of the API, as far as I know. Because of that, I end up only using it rarely, when I don’t care about the fate of the data.
Can you elaborate on “paid” ? Because I honestly still have no idea if my usage of AI Studio is used for training purposes.
I have google workspace business standard, which comes with some pro AI features. Eg, Gemini chat clearly shows “Pro”, and says something like “chats in your organization won’t be used for training”.
On AI Studio it’s not clear at all. I do have some version of paid AI services through Google, but no idea if it applies to AI studio. I did create some dummy Google cloud project which allowed me to generate api key, but afaik I still haven’t authorized any billing method.
Thank you for clarifying that. I’ve researched this once again and confirmed that Google treats all AI Studio usage as private if there’s at least one API project with billing enabled in an account.
Yes. I haven't had problems with the output limit so far, as I do translations iteratively, over each section of longer texts.
What I like the most about translating with Gemini is that its default performance is already good enough, and it can be improved via the one million tokens of the context window. I load to the context my private databases of idiomatic translations, separated by language pairs and subject areas. After doing that, the need for manually reviewing Gemini translations is greatly diminished.
I tried swapping for my project which involves having the LLM summarize and critique medical research and didn’t have great results. The prompt I found works best with the main LLM I use fucks up the intended format when fed to other LLMs. Thinking about refining prompts for each different llm but haven’t gotten there.
My favorite personal use of Gemini right now is basically as a book club. Of course it’s not as good as my real one but I often can’t them to read the books I want and Gemini is always ready when I want to explore themes. It’s often more profound than the book club too and seems a bit less likely to tunnel vision. Before LLMs I found exploring book themes pretty tedious, often I would have to wait a while to find someone who had read it but now I can get into it as soon as I’m done reading.
I can throw a pile of NDAs at it and it neatly pulls out relevant stuff from them within a few seconds. The huge context window and excellent needle in a haystack performance is great for this kind of task.
The NIAH performance is a misleading indicator for performance on the tasks people really want the long context for. It's great as a smoke/regression test. If you're bad on NIAH, you're not gonna do well on the more holistic evals.
But the long context eval they used (MRCR) is limited. It's multi-needle, so that's a start, but its not evaluating long range dependency resolution nor topic modeling, which are the things you actually care about beyond raw retrieval for downstream tasks. Better than nothing, but not great for just throwing a pile of text at it and hoping for the best. Particularly for out-of-distribution token sequences.
I do give google some credit though, they didn't try to hide how poorly they did on that eval. But there's a reason you don't see them adding RULER, HELMET, or LongProc to this. The performance is abysmal after ~32k.
EDIT: I still love using 2.5 Pro for a ton of different tasks. I just tend to have all my custom agents compress the context aggressively for any long context or long horizon tasks.
Huh. We've not seen this in real-world use. 2.5 pro has been the only model where you can throw a bunch of docs into it, give it a "template" document (report, proposal, etc), even some other-project-example stuff, and tell it to gather all relevant context from each file and produce "template", and it does surprisingly well. Couldn't reproduce this with any other top tier model, at this level of quality.
We're a G-suite shop so I set aside a ton of time trying to get 2.5 pro to work for us. I'm not entirely unhappy with it, its a highly capable model, but the long context implosion significantly limits it for the majority of task domains.
We have long context evals using internal data that are leveraged for this (modeled after longproc specifically) and the performance across the board is pretty bad. Task-wise for us, it's about as real world as it gets, using production data. Summarization, Q&A, coding, reasoning, etc.
But I think this is where the in-distribution vs out-of-distribution distinction really carries weight. If the model has seen more instances of your token sequences in training and thus has more stable semantic representations of them in latent space, it would make sense that it would perform better on average.
In my case, the public evals align very closely with performance on internal enterprise data. They both tank pretty hard. Notably, this is true for all models after a certain context cliff. The flagship frontier models predictably do the best.
MRCR does go significantly beyond multi-needle retrieval - that's why the performance drops off as a function of context length. It's still a very simple task (reproduce the i^th essay about rocks), but it's very much not solved.
Sure. I didn't imply (or didn't mean to imply at least) that I thought MRCR was solved, only pointing out that it's closer to testing raw retrieval than it is testing long range dependency resolution like Longproc does. If retrieval is great but the model still implodes on the downstream task, the benchmark doesn't tell you the whole story. The intent/point of my original comment was that even the frontier models are nowhere near as good at long context tasks than what I see anecdotally claimed about them in the wild.
> The other evals you mention are not necessarily harder than this relatively simple one.
If you're comparing MRCR to for example Longproc, I do think the latter is much harder. Or at least, much more applicable to long-horizon task domains where long context accumulates over time. But I think it's probably more accurate to say its a more holistic, granular eval by comparison.
The tasks require the model to synthesize and reason over information that is scattered throughout the input context and across previously generated output segments. Additionally, the required output is lengthy (up to 8K tokens) and must adhere to a specific, structured format. The scoring is also more flexible than MRCR: you can use row-level F1 scores for tables, execution-based checks for code, or exact matches for formatted traces.
Just like NIAH, I don't think MRCR should be thrown out wholesale. I just don't think it can be pressed into the service of representing a more realistic long context performance measure.
EDIT: also wanted to note that using both types of evals in tandem is very useful for research and training/finetuning. If Longproc tanks and you dont have the NIAH/MRCR context, its hard to know what capabilities are regressing. So using both in a hybrid eval approach is valuable in certain contexts. For end users only trying to guage the current inference-time performance, I think evals like RULER and Longproc have a much higher value.
Right, the way I see it, MRCR isn't a retrieval task in the same vein as RULER. It’s less about finding one (or multiple) specific facts and more about piecing together scattered information to figure out the ordering of a set of relevant keys. Of course, it’s still a fairly simple challenge in the grand scheme of things.
LongProc looks like a fantastic test for a different but related problem, getting models to generate long answers. It seems to measure a skill the others don't. Meanwhile, RULER feels even more artificial than MRCR, since it's almost entirely focused on that simple "find the fact" skill.
But I think you're spot-on with the main takeaway, and the best frontier models are still struggling with long context. The DeepMind team points this out in the paper with that Pokemon example and the MRCR evaluation scores themselves.
I’ve found the 2.5 pro to be pretty insane at math. Having a lot of fun doing math that normally I wouldn’t be able to touch. I’ve always been good at math, but it’s one of those things where you have to do a LOT of learning to do anything. Being able to breeze through topics I don’t know with the help of AI and a good CAS + sympy and Mathematica verification lets me chew on problems I have no right to be even thinking about considering my mathematical background. (I did minor in math.. but the kinds of problems I’m chewing on are things people spend lifetimes working on. That I can even poke at the edges of them thanks to Gemini is really neat.)
Gemini Flash 2.0 is an absolute workhorse of a model at extremely low cost. It's obviously not going to measure up to frontier models in terms of intelligence but the combination of low cost, extreme speed, and highly reliable structured output generation make it really pleasant to develop with. I'll probably test against 2.5 Lite for an upgrade here.
We use it by having a Large Model delegate to Flash 2.0. Let's say you have a big collection of objects and a SOTA model identifies the need to edit some properties of one of them. Rather than have the Large Model perform a tool call or structured output itself (potentially slow/costly at scale), it can create a small summary of the context and change needed.
You can then provide this Flash 2.0 and have it generate the full object or diffed object in a safe way using the OpenAPI schema that Gemini accepts. The controlled generation is quite powerful, especially if you create the schema dynamically. You can generate an arbtirarily complex object with full typing, restrict valid values by enum, etc. And it's super fast and cheap and easily parallelizable. Have 100 objects to edit? No problem, send 100 simultaneous flash 2.0 calls. It's google, they can handle it.
I use it extensively for https://lexikon.ai - in particular one part of what Lexikon does involves processing large amounts of images, and the way Google charges for vision is vastly cheaper compared to the big alternatives (OpenAI, Anthropic)
I mean I've copy pasted conversations and emails into ChatGPT as well, it often gives good advice on tricky problems (essentially like your own personalized r/AmITheAsshole chat). This service seems to just automate that process.
I use Gemini 2.5 Flash (non thinking) as a thought partner. It helps me organize my thoughts or maybe even give some new input I didn't think of before.
I really like to use it also for self reflection where I just input my thoughts and maybe concerns and just see what it has to say.
It basically made a university physics exam for me. It almost one-shot it as well. Just uploaded some exams from previous years together with a latex template and told it to make me a similar one. Worked great. Also made it do the solutions.
It's very good at automatically segmenting and recognizing handwritten and badly scanned text. I use it to make spreadsheets out of handwritten petitions.
Web scraping - creating semi-structured data from a wide variety of horrific HTML soups.
Absolutely do swap out models sometimes, but Gemini 2.0 Flash is the right price/performance mix for me right now. Will test Gemini 2.5 Flash-Lite tomorrow though.
I've yet to run out of free image gen credits with Gemini, so I use it for any low-effort image gen like when my kids want to play with it or for testing prompts before committing my o4 tokens for better quality results.
Yes, we implemented a separate service internally that interfaces with an LLM and so the callers can be agnostic as to what provider or model is being used. Haven't needed to load balance between models though.
I had a great-ish result from 2.5 Pro the other day. I asked it to date an old photograph, and it successfully read the partial headline on a newspaper in the background (which I had initially thought was too small/blurry to make out) and identified the 1980s event it was reporting. Impressive. But then it confidently hallucinated the date of the article (which I later verified by checking in an archive).
I run a batch inference/LLM data processing service and we do a lot of work around cost and performance profiling of (open-weight) models.
One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.
At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.
Considering moving from Groq Llama 3.3 70b to Gemini 2.5 Flash Lite for one of my use cases. Results are coming in great, and it's very fast (important for my real-time user perception needs).
What kind of rate limits do these new Gemini models have?
I'm using it from their HTTP API. Limits I can't remember what they were initially tbh, I had to reach out through backchannels to get it increased to 300,000 tokens per minute.
It feels to me like properly instrumented, these diffusion models are going to be really powerful coding tools. Imagine a “smart” model carving out a certain number of tokens in a response for each category of response output, then diffusing the categories.
Gemini 2.5 doesn’t get enough credit for the quality of its writing in non-code (eg law) topics. It’s definitely a notch below Claude 4, but well ahead of ChatGPT 4o, 4.5, o3.
for anyone, who was expecting more news: the GA models benchmark basically the same as the last preview models. It's really just Google telling us that we get less api errors and this model will have a checkpoint for a longer time.
I'm glad that they standardized pricing for the thinking vs non-thinking variant. A couple weeks ago I accidentally spent thousands of extra dollars by forgetting to set the thinking budget to zero. Forgetting a single config parameter should not automatically raise the model cost 5X.
[edit] I'm less excited about this because it looks like their solution was to dramatically raise the base price on the non-thinking variant.
I switched to 2.5 Flash (non-think) for most of my projects because it was such a good model with good pricing.
Cost is an important factor so hoping that flash-lite is sufficient, even tho its somtimes more than 50% worse in relevant benchmarks which sucks.
Was also just looking at 4.1-mini but thats more expensive and often scores around the same as flash-lite in benchmarks (except coding which i dont care about).
Crazy to think that even after this move by google, openai is still the worse option for me, at least regarding API. Other than API im actually using chatgpt (o3/o4-mini, 4o is a joke) a lot more again lately after 2.5 Pro got nerfed
I really wish all the AI companies would down tools on all development until they work out file downloads, ftp, sftp, git ANY way to access the files other than copy paste and “download file”.
The workflow is crushingly tedious.
And no I don’t want to use an AI IDE or some other tool. I like the UI of Gemini chat and AI Studio and I want them improved.
Blended price (assuming 3:1 for input:output tokens) is 3.24x of what was stated before [1], and now nearly 5x of 2.0 Flash. Makes 2.0 Flash a still competitive option for many use-cases, particularly ones that aren't coding-heavy I think. A slightly poorer performing model can net perform better through multiple prompt passes. Bummer, was hoping 2.5 Flash would be a slam dunk choice.
I have about 500,000 news articles I am parsing. OpenAI models work well but found Gemini had fewer mistakes.
Problem is; they give me a terrible 10k RPD limit. To increase to the next tier, they then require a minimum amount of spending but I can't reach that amount even when maxing the RPD limit for multiple days in a row.
I emailed them twice and completed their forms but everyone knows how this works. So now I'm back at OpenAI, with a model with a bit more mistakes but that won't 403 me after half an hour of using it due to their limits.
The rate limits apply only to the Gemini API. There is also Vertex from GCP, which offers the same models (and even more, such as Claude) at the same pricing, but with much higher rate limits (basically none, as long as they don't need to cut anyone off with provisioned throughput iiuc) and with a process to get guaranteed throughput.
I have a huge background.js file from a now removed browser extension that the Devs made into a single line. Around 800KB of a single line file I think....
I tried many free stuff to try to refactor it but they all loose context window quickly.
Anyone else unable to access 2.5-pro via api? I'm currently getting "Publisher Model `projects/349775993245/locations/us-west4/publishers/google/models/gemini-2.5-pro` was not found or your project does not have access to it. Please ensure you are using a valid model version."
Not sure where else to post this, but when attempting to use any of the Gemini 2.5 models via API, I receive an "empty content" response about 50% of the time. To be clear, the API responds successfully, but the `content` returned by the LLM is just an empty string.
Has anyone here had any luck working around this problem?
What finish reason are you getting? Perhaps your code sets a low max_tokens, so the generation stops while the model is still thinking, without giving any actual output.
The finish reason is `length`. I have tried setting minimal token budgets, really small prompts, and max lengths of various sizes from 100-4000 and nothing seems to make a consistent dent in the behavioral pattern.
2.5 Flash Lite seems better at everything compare to 2.0 Flash Lite with the only exception being SimpleQA, so there is probably a small tradeoff on pop culture knowledge for coding, math, science, reasoning and multimodal tasks.
Classic bait-and-switch to make developers build things on top off models for 2 months, and then raise input price by 2x and output by 4x. But hey, it's Google, wouldn't expect anything else from an advertising company.
I am always disappointed when I compare the answers to the same queries on 2.5 Pro vs. o4-mini/o3. But trying out the same query in AI Studio gives much better results, closer to OpenAI's models.
What is wrong with 2.5 Pro in the Gemini app? I can't believe that the model in their consumer app would produce the same benchmark results as 2.5 Pro in the API or AI Studio.
The models in the Gemini app are nerfed in comparison to those in AI Studio: they have less thinking budget, output less tokens, and have various safety filters. There’s certainly a trade-off between using AI Studio for its better performance and using the API or the Gemini app in a way that doesn’t involve Google keeping your data for training purposes.
I don't have any inside information, but I'm sure there are different system prompts used in the Gemini chat interface vs the API. On OpenAI/ChatGPT they're sometimes dramatically different.
been testing gemini flash lite. latency is good, responses land under 400ms most times. useful for low-effort rewrites or boilerplate filler. quality isn’t stable though : context drifts after 4-5 turns, especially with anything recursive or structured. tried tagging it into prompt chains but fallback logic ends up too aggressive. good for assist, not for logic, wouldn't anchor anything serious on it yet
Anyone else unable to access 2.5-pro via api? I'm currently getting "Publisher Model `projects/349775993245/locations/us-west4/publishers/google/models/gemini-2.5-pro` was not found or your project does not have access to it. Please ensure you are using a valid model version."
I tried using the three new models to transcribe the audio of this morning's Gemini Twitter Space.
I got very strong results from 2.5 Pro and 2.5 Flash, but 2.5 Flash Lite sadly got stuck in a loop until it ran out of output tokens:
Um, like, what did the cows bring to you? Nothing. And then, um, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and...
If everyone searches through Gemini, then there are no click conversions. Without click conversions there is no incentive for many websites to make new content. Without content there is nothing new to learn for Gemini. It's an Ouroboros problem.
So you mean only people who want to post something for the fun of posting it will post and we won't get as much corporate crap/SEO? Darn.
Of course practically this means people who post will have a reason to post and that reason might be to influence Gemini, so double darn.
I said a majority. Say stack overflow, medium articles, online newspapers, they live off of click conversion. The internet is more than people just posting for fun. Of course everyone only posting for fun would be ideal.
But even if you only post for fun, if most people only find your content by seeing it re-hashed by a LLM that hallucinates some other stuff in the middle, I'd probably also lose the fun in posting for posting sake.
For me, I actually feel much more motivated to write good and accurate documentation knowing there will be at least one reader who is going to look at it very closely and will attempt to synthesize useful information from it.
Same with my old open-source projects, it's kinda cool knowing that all the old stuff that nobody would have ever looked at anymore is now part of a humanity-wide corpus of useful knowledge on how to do X with language Y.
Yeah, a lot of the web is at-risk if searchers just read LLM summaries and don't click through, but we will have Skynet before that becomes a real issue and then this will all be irrelevant.
Stack overflow makes a profit off click conversions. It does not exist because of it. People would have built a version regardless of the profit margin involved.
Running the site costs money. So you either run ads or make users pay. Either way, usage of LLMs for searching would decrease the income for these sites.
They don't mention it in the post, but it looks like this includes a price increase for the Gemini 2.5 Flash model.
For 2.5 Flash Preview https://web.archive.org/web/20250616024644/https://ai.google...
$0.15/million input text / image / video
$1.00/million audio
Output: $0.60/million non-thinking, $3.50/million thinking
The new prices for Gemini 2.5 Flash ditch the difference between thinking and non-thinking and are now: https://ai.google.dev/gemini-api/docs/pricing
$0.30/million input text / image / video (2x more)
$1.00/million audio (same)
$2.50/million output - significantly more than the old non-thinking price, less than the old thinking price.
The blog post has more info about the pricing changes
https://developers.googleblog.com/en/gemini-2-5-thinking-mod...
The real news is that non-thinking output is now 4x more expensive, which they of course carefully avoid mentioning in the blog, only comparing the thinking prices.
How cute they are with their phrasing:
> $2.50 / 1M output tokens (*down from $3.50 output)
Which should be "up from $0.60 (non-thinking)/down from $3.50 (thinking)"
I have LLM fatigue, so I'm not paying attention to headlines... but LLMs are thinking now? That used to be a goal post. "AI can't do {x} because it's not thinking." Now it's part of a pricing chart?
How did I miss this?
"Thinking" means spamming a bunch of stream-of-consciousness bs before it actually generates the final answer. It's kind of like the old trick of prompting to "think step by step". Seeding the context full of relevant questions and concepts improves the quality of the final generation, even though it's rarely a direct conclusion of the so-called thinking before it.
"Thinking" really just means "write on some scratch paper" for llms.
Is it possible to get non-thinking only now, though? If not, why would that matter, since it's irrelevant?
Yes, by setting the thinking budget to 0. Which is very common when a task doesn't need thinking.
In addition, it's also relevant because for the last 3 months people have built things on top of this.
To be fair, the point of preview models and stable releases is so you know what is stable to build on.
The moment you start charging for preview stuff I think you give a tacit agreement that you can expect the price to not increase by a factor of 4.
that’s a somewhat naïve viewpoint.
I think the fact that everyone is like ‘wtf’ now kind of reinforces my viewpoint?
Doesn’t mean you can’t do it, but people won’t be happy.
Gmail was in beta for what, 2 decades? Did you never use it during that time? They've been using these "Preview" models on their non-technical user facing Gemini app and product for months now. Like, Google themselves has been using them in production, on their main apps. And gemini-1.5-pro is 2 months from depreciation and there was no production alternative.
They told everyone to build their stuff on top of it, and then jacked up the price by 4x. Just pointing to some fine print doesn't change that.
I'd be more worried about Google just discontinuing another product. For example Stadia was similarly high profile, but it's gone now.
More examples here: https://killedbygoogle.com/
interesting - why wouldn't you use dynamic thinking? and yeah, sucks when the price changes.
It makes responses much slower with zero benefit for many tasks. Flash with thinking off is very fast.
one example where non-thinking matters would be latency-sensitive workflows, for example voice AI.
Correct, though pretty much anything end-user facing is latency-sensitive, voice is a tiny percentage. No one likes waiting, the involvement of an LLM doesn't change this from a user PoV.
I wonder if you can hide the latency, especially for voice?
What I have in mind is to start the voice response with a non-thinking model, say a sentence or two in a fraction of a second. That will take the voice model a few seconds to read out. In that time, you use a thinking model to start working on the next part of the response?
In a sense, very similar to how everyone knows to stall in an interview by starting with 'this is a very good question...', and using that time to think some more.
They seem just rebrand the non-thinking model to flash-lite, so it’s less expensive than before
Not at all. Non-thinking flash is... flash with the thinking budget set to 0 (which you can still run that way, just at 2x input 4x output pricing). Flash-lite is far weaker, unusable for the overwhelming majority of usecases of flash. A quick glance at the benchmark reveals this.
Yeah, so basically their announcement is "good news, we tripled the price, and will deprecate Gemini Flash 2.0 asap"
The OP says Flash-Lite has thinking and non-thinking, so it’s not that simple.
> Today we are excited to share updates …
They are obviously excited about their price increase
“While we strive to maintain consistent pricing between preview and stable releases to minimize disruption, this is a specific adjustment reflecting Flash’s exceptional value, still offering the best cost-per-intelligence available.”
Anthropic did the same thing with their Haiku model when they released version 3.5. I hate it.
Pricing is a hard problem. Theoretically, if companies occasionally raise prices dramatically once something is useful, they sometimes can create early demand and more testers for future product releases. Ofc they have to be careful to avoid annoying regular users too much. When you sell the harm is limited to late users, but when you rent it is harder to figure out the optimal strategy.
Do you work for google?
[flagged]
"Soon, AI too cheap to meter" "Meantime, price go up".
Not too long ago Google was a bit of a joke in AI and their offerings were uncompetitive. For a while a lot of their preview/beta models had a price of 0.00. They were literally giving it away for free to try to get people to consider their offerings when building solutions.
As they've become legitimately competitive they have moved towards the pricing of their competitors.
There are a lot more price drops, though.
From prices that were already losing services money.
If you aren't making a profit, lowering prices is only about trying to capture market share before you're forced to increase prices to remain solvent.
"will be too cheap to meter" means we're definitely metering it now.
Just google. They were behind. So they just dumped their prices to get a foot in the door. Now they are popular and can raise it to market prices.
I still don’t think there’s any real stickiness to using a Google model over any other model, with things like openrouter. So maybe for brand recognition alone.
Yea, but brand have some stickiness. Maybe not for the absolute nerds but lots of people just stick to what they are already using. Look at all the people just using ChatGPT because that is what they tried first.
No way. AI pricing is going up because people are willing to pay for it.
We have likely seen the cheapest prices already. Once we can’t function without them anymore - go as high as you can!
By then comparable or even better models will easily run on edge.
So if they crank up the prices we could just switch to local and not get lured by bigger and bigger models, rag, Agentic, MCP driven tech as if all of that couldn't run locally either.
I am not as optimistic that locally run models will be able to compete anytime soon. And even if, the price to run them means you have to buy the compute/gear for a price that is likely equivalent to a lot of 'remote' tokens
> By then comparable or even better models will easily run on edge.
What are you basing that on?
Presumably your goal is to extract some practical value from this and not just higher benchmark numbers. If you can get the functionality you need from last-gen, there's no point in paying for next-gen. YMMV.
Indeed that was the premise that we would be a step behind when running on edge. That's already the case.
The most meaningful models will run in the future on those trillion dollar data centers that are currently being build.
Hopefully we get more competition and someone willing to undercut the more expensive options
It's more likely the shareholder zeitgeist will soon shift to demanding returns on the ungodly amounts already invested into AI.
Entering the market and being competitive gets more difficult all the time. People want the best and fastest models - can you compete with trillion dollar datacenters?
You might be right, but there's plenty of deep pocketed companies who are still very excited to compete in this market.
You know that competition is a thing, do you?
I know they undercutting the price a lot, because at first launch gemini price is not make sense seeing it cheaper than competition (like a lot cheaper)
finally we starting to see the real price
A cool 2x+ price increase.
And Gemini 2.0 Flash was $0.10/$0.40.
1.5 -> 2.0 was a price increase as well (double, I think, and something like 4x for image input)
Now 2.0 -> 2.5 is another hefty price increase.
4x price increase over preview output for non-thinking.
FWIW: On OpenRouter, the non `:thinking` 2.5 flash endpoint seems to be returning reasoning tokens now.
Good catch, that's a pretty notable change considering this was about to be the GOAT of audio-to-audio
You can also see this difference in open router.
But why is there only thinking flash now?
It might be a bit confusing, but there's no "only thinking flash" - it's a single model, and you can turn off thinking if you set thinking budget to 0 in the API request. Previously 2.5 Flash Preview was much cheaper with the thinking budget set to 0, now the price is the same. Of course, with thinking enabled the model will still use far more output tokens than the non-thinking mode.
Interesting design choice, and makes me think of "Thinking, Fast and Slow" by Kahneman.
(I thought of it quickly, not slowly, so the comparison may only be surface deep.)
Apparently, you can make a request to 2.5 flash to not use thinking, but it will still sometimes do it anyways, this has been an issue for months, and hasn't been fixed by model updates: https://github.com/google-gemini/cookbook/issues/722
At one point, when they made Gemini Pro free on AI Studio, Gemini was the model of choice for many people, I believe.
Somehow it's gotten worse since then, and I'm back to using Claude for serious work.
Gemini is like that guy who keeps talking but has no idea what he's actually talking about.
I still use Gemini for brainstorming, though I take its suggestions with several grains of salt. It's also useful for generating prompts that I can then refine and use with Claude.
not according to Aider leaderboard https://aider.chat/docs/leaderboards/
I use only the APIs directly with Aider (so no experience with AI Studio).
My feeling with Claude is that they still perform good with weak prompts, the "taste" is maybe a little better when the direction is kinda unknown by the prompter.
When the direction is known I see Gemini 2.5 Pro (with thinking) on top of Claude with code which does not break. And with o4-mini and o3 I see more "smart" thinking (as if there is a little bit of brain inside these models) at the expense of producing unstable code (Gemini produces more stable code).
I see problems with Claude when complexity increases and I would put it behind Gemini and o3 in my personal ranking.
So far I had no reason to go back to Claude since o3-mini was released.
I just spent $35 for Opus to solve a problem with a hardware side-project (I'm turning an old rotary phone into a meeting handset so I can quit meetings by hanging up, if you must know). It didn't solve the problem, it churned and churned and spent a ton of money.
I was much more satisfied with o3 and Aider, I haven't tried them on this specific problem but I did quite a bit of work on the same project with them last night. I think I'm being a bit unfair, because what Claude got stuck on seems to be a hard problem, but I don't like how they'll happily consume all my money trying the same things over and over, and never say "yeah I give up".
For basically that same price you could get one of these :-)
https://www.amazon.com/Cell2jack-Cellphone-Adapter-Receive-l...
Where's the fun in that?!
Enjoy yourself! Don’t let me spoil your fun :-)
Oh I'm not! I'll post it here when I'm done, it's already hilarious.
wait, you're using a rotary phone ?
I want to!
Give them feedback.
Feedback on what?
When I obtain results from one paid model that are significantly better than what I previously got from another paid model, I'll typically give a thumbs-down to the latter and point out in the comment that it was beaten by a competitor. Can't hurt.
Ah, this wasn't from the web interface, I was using Claude Code. I don't think it has a feedback mechanism.
Using all of the popular coding models pretty extensively over the past year, I've been having great success with Gemini 2.5 Pro as far as getting working code the first time, instruction following around architectural decisions, and staying on-task. I use Aider and write mostly Python, JS, and shell scripts. I've spent hundreds of dollars on the Claude API over time but have switched almost entirely to Gemini. The API itself is also much more reliable.
My only complaint about 2.5 Pro is around the inane comments it leaves in the code (// Deleted varName here).
If you use one of the AI static instructions methods (e.g., .github/copilot-instructions.md) and tell it to not leave the useless comments, that seems to solve the issue.
I've been intending to try some side by side tests with and without a conventions file instructing it not to leave stupid comments—I'm curious to see if somehow they're providing value to the model, e.g. in multi-turn edits.
it's easier to just make it do a code review with focus on removing unhelpful comments instead of asking it not to do it the first time. I do the cleanup after major rounds of work and that strategy seems to work best for me.
This was not my experience with the earlier preview (03), where its insistence on comment spam was too strong to overcome. Wonder if this adherence improved in the 05 or 06 updates.
can you elaborate on this?
I don't mind the comments, I read them while removing them. It's normal to have to adapt the output, change some variable names, refactor a bit. What's impressive is that the output code actually works (or almost). I didn't give it the hardest of problems to solve/code but certainly not easy ones.
Yeah I've mostly just embraced having to remove them as part of a code review, helps focus the review process a bit, really.
I'm using pro for backend and claude for ux work, claude is so much thoughtful about how user interact with software and can usually replicate better the mock up that gpt4o image generator produces, while not being overly fixated on the mockup design itself.
My complaint is that it catches python exceptions and don't log them by default.
And the error handling. God, does it love to insert random try/except statements everywhere.
You feelings of a little brain in there, and stable code are unfounded. All these models collapse pretty fast. If not due to context limit, then in their inability to interpret problems.
An LLM is just statistical regressions with a llztjora of engineering tricks, mostly NLP to produce an illusion.
I don't mean it's useless. I mean comparing these ever evolving models is like comparing escort staff in NYC vs those in L.A, hard to reach any conclusjon. We are getting fooled.
On the price increase, it seems Google was aggressively looking for adoption, Gemini was for a short range of time the best value for money of all the LLMs out there. Adoption likely surged, scaling needs be astronomical, costing Google billions to keep up. The price adjustment could've been expected before they announced it.
Yea, i had similar experiences. At first it felt like it solved complex problems really well, but then i realized i was having trouble steering it for simple things. It was also very verbose.
Overall though my primary concern is the UX, and Claude Code is the UX of choice for me currently.
Check out zen MCP server https://github.com/BeehiveInnovations/zen-mcp-server Lets you use Gemini and OpenAI models in Claude Code.
Ooh this seems nice. Most similar solutions monkeypatch the npm package, which is a bit icky
Same experience here. I even built a Gem with am elaborate prompt instructing it how to be concise, but it still gives annoying long-winded responses and frequently expands the scope of its answer far beyond the prompt.
I feel like this is part of the AI playbook now. Launch a really strong, capable model (expensive price inference) and once users think it’s SOTA, neuter it so the cost is cheaper and most users won’t notice.
The same happened with GPT-3.5. It was so good early on and got worse as OpenAI began to cut costs. I feel like when GPT-4.1 was cloaked as Optimus on Openrouter, it was really good, but once it launched, it also got worse.
That is the capitalism' playbook all along. Its just much faster because its just software. But they do it for everything all the time.
I disagree with the comparison between LLM behavior and traditional software getting worse. When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals. Companies often don’t bother hiding it, since their users are typically locked into their ecosystem.
LLMs, on the other hand, operate under different incentives. It’s in a company’s best interest to initially release the strongest model, top the benchmarks, and then quietly degrade performance over time. Unlike traditional software, LLMs have low switching costs, users can easily jump to a better alternative. That makes it more tempting for companies to conceal model downgrades to prevent user churn.
> When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals.
Counterexample: 99% of average Joes have no idea how incredibly enshittified Google Maps has become, to just name one app. These companies intentionally boil the frog very slowly, and most people are incredibly bad at noticing gradual changes (see global warming).
Sure, they could know by comparing, but you could also know whether models are changing behind the scenes by having sets of evals.
This is where switching costs matter. Take Google Maps, many people can’t switch to another app. In some areas, it’s the only app with accurate data, so Google can degrade the experience without losing users.
We can tell it’s getting worse because of UI changes, slower load times, and more ads. The signs are visible.
With LLMs, it’s different. There are no clear cues when quality drops. If responses seem off, users often blame their own prompts. That makes it easier for companies to quietly lower performance.
That said, many of us on HN use LLMs mainly for coding, so we can tell when things get worse.
Both cases involve the “boiling frog” effect, but with LLMs, users can easily jump to another pot. With traditional software, switching is much harder.
Do you mind explaining how you see this working as a nefarious plot? I don't see an upside in this case so I'm going with the old "never ascribe to malice" etc
I have no inside information but feels like they quantized it. I've seen patterns that I usually only see in quantized models like getting stuck repeating a single character indefinitely
They should just roll back to the preview versions. Those were so much more even keeled and actually did some useful pushback instead of this cheerleader-on-steroids version they GA'd.
Yes I was very surprised after the whole "scandal" around ChatGPT becoming too sycophantic that there was this massive change in tone from the last preview model (05-06) to the 06-05/GA model. The tone is really off-putting, I really liked how the preview versions felt like intelligent conversation partners and recognize what you're saying about useful pushback - it was my favorite set of models (the few preview iterations before this one) and I'm sad to see them disappearing.
Many people on the Google AI Developer forums have also noted either bugs or just performance regression in the final model.
But they claim it's the same model and version?
I don't know but it sure doesn't feel the same. I have been using Gemini 2.5 pro (preview and now GA) for a while. The difference in tone is palpable. I also noticed that the preview took longer time and the GA is faster so it could be quantization.
Maybe a bunch of people with authority to decide thought that it was too slow/expensive/boring and screwed up a nice thing.
They made it talk like buzzfeed articles for every single interaction. It's absolutely horrible
I found Gemini now terrible for coding. I gave it my code blocks and told it what to change and it added tonnes and tonnes of needles extra code plus endless comments. It turned a tight code into a Papyrus.
ChatGPT is better but tends to be too agreeable, never trying to disagree with what you say even if it's stupid so you end up shooting yourself in the foot.
Claude seems like the best compromise.
Just my two kopecks.
Used to be able to use Gemini Pro free in cline. Now the API limits are so low that you immediately get messages about needing to top up your wallet and API queries just don't go through. Back to using DeepSeek R1 free in cline (though even that eventually stops after a few hours and you have to wait until the next day for it to work again). Starting to look like I need to setup a local LLM for coding - which means it's time to seriously upgrade my PC (well, it's been about 10 years so it was getting to be time anyway)
By the time you breakeven on whatever you spend on a decent LLM capable build, your hardware will be too far behind to run whatever is best locally then. It's something that feels cheaper but with the pace of things, unless you are churning an insane amount of tokens, probably doesn't make sense. Never mind that local models running on 24 or 48GB are maybe around flash-lite in ability while being slower than SOTA models.
Local models are mostly for hobby and privacy, not really efficiency.
When I ask it do to do something in cursor it goes full sherlock thinking about every possible outcome.
Just claude 4 sonnet with thinking just has a bit think then does it
Same for me. I've been using Gemini 2.5 Pro for the past week or so because people said Gemini is the best for coding! Not at all my experience with Gemini 2.5 Pro, on top of being slow and flaky, the responses are kind of bad. Claud Sonnet 4 is much better IMO.
The context window on ai studio feels endless.
All other ai’s seem to give me errors when working with large bodies of code.
They nerfed Pro 2.5 significantly in the last few months. Early this year, I had genuinely insightful conversations with Gemini 2.5 Pro. Now they are mostly frustrating.
I also have a personal conspiracy theory, i.e., that once a user exceeds a certain use threshold of 2.5 Pro in the Google Gemini app, they start serving a quantized version. Of course, I have no proof, but it certainly feels that way.
Maybe they've been focusing so much on improving coding performance with RL for the new versions/previews that other areas degraded in performance
I think you are right and this is probably the case.
Although, given that I rapidly went from +4 to 0 karma, a few other comments in this topic are grey, and at least one is missing, I am getting suspicious. (Or maybe it is just lunch time in MTV.)
There was a significant nerf of Gemini 3-25 a little while ago, so much so that I detected it without knowing there was even a new release.
Totally convinced they quantized the model quietly and improved on the coding benchmark to hide that fact.
I’m frankly quite tired of LLM providers changing the model I’m paying for access to behind the scenes, often without informing me, and in Gemini’s case on the API too—at least last time I checked they updated the 3-25 checkpoint to the May update.
I wonder how smart they are about quantizing. Do they look at feedback to decide which users won't mind?
One of the early updates improved agentic coding scores while lowering other general benchmark scores, which may have impacted those kind of conversations.
I am very impressed with Gemini and stopped using OpenAI. Sometimes, I ping all three major models on OpenRouter but 90% is on Gemini now. Compare that to 90% ChatGPT last year.
I love to hate on google, but yeah their models are really good. The larger context window is huge
Doesn't OpenAI's GPT 4.1 also have 1 million context length?
Same. For now I have canceled my claude subscription. Gemini has been catching up.
Also me. Still pay for OpenAI, I use gpt4 for excel work and is super fast and able to do more excel related work like combine files that come up often for projects I work on.
I don't like the thinking time, but for coding, journaling, and other stuff I've often been impressed with Gemini Pro 2.5 out of the box.
Possibly I could do much more prompt fine-tuning to nudge openai/anthropic in the direction I want, but with the same prompts Gemini often gives me answers/structure/tone I like much better.
Example: I had claude 3.7 generating embedding images and captions along with responses. Same prompt into Gemini it gave much more varied and flavorful pictures.
Curious. How do you use gemini for journaling? What is your workflow?
Love to see it, this takes Flash Lite from "don't bother" territory for writing code to potentially useful. (Besides being inexpensive, Flash Lite is fast -- almost always sub-second, to as low as 200ms. Median around 400ms IME.)
Brokk (https://brokk.ai/) currently uses Flash 2.0 (non-Lite) for Quick Edits, we'll evaluate 2.5 Lite now.
ETA: I don't have a use case for a thinking model that is dumber than Flash 2.5, since thinking negates the big speed advantage of small models. Curious what other people use that for.
To me, if it thinks fast enough, I don't care how much thinking it does.
Curious to hear what folks are doing with Gemini outside of the coding space and why you chose it. Are you building your app so you can swap the underlying GenAI easily? Do you "load balance" your usage across other providers for redundancy or cost savings? What would happen if there was ever some kind of spot market for LLMs?
In my experience, Gemini 2.5 Pro really shines in some non-coding use cases such as translation and summarization via Canvas. The gigantic context window and large usage limits help in this regard.
I also believe Gemini is much better than ChatGPT in generating deep research reports. Google has an edge in web search and it shows. Gemini’s reports draw on a vast number of sources, thus tend to be more accurate. In general, I even prefer its writing style, and I like the possibility of exporting reports to Google Docs.
One thing that I don’t like about Gemini is its UI, which is miles behind the competition. Custom instructions, projects, temporary chats… these things either have no equivalent in Gemini or are underdeveloped.
If you're a power user, you should probably be using Gemini through AI studio rather than the "basic user" version. That allows you to set system instructions, temperature, structured output, etc. There's also NotebookLM. Google seems to be trying to make a bunch of side projects based on Gemini and seeing what sticks, and the generic gemini app/webchat is just one of those.
My complaint is that any data within AI Studio can be kept by Google and used for training purposes — even if using the paid tier of the API, as far as I know. Because of that, I end up only using it rarely, when I don’t care about the fate of the data.
This is only true for the free tier. Paid Ai Studio users have strong privacy protections.
Can you elaborate on “paid” ? Because I honestly still have no idea if my usage of AI Studio is used for training purposes.
I have google workspace business standard, which comes with some pro AI features. Eg, Gemini chat clearly shows “Pro”, and says something like “chats in your organization won’t be used for training”. On AI Studio it’s not clear at all. I do have some version of paid AI services through Google, but no idea if it applies to AI studio. I did create some dummy Google cloud project which allowed me to generate api key, but afaik I still haven’t authorized any billing method.
Thank you for clarifying that. I’ve researched this once again and confirmed that Google treats all AI Studio usage as private if there’s at least one API project with billing enabled in an account.
for translation you'll still be limited for longer texts by the 65K output limit though I suppose?
Yes. I haven't had problems with the output limit so far, as I do translations iteratively, over each section of longer texts.
What I like the most about translating with Gemini is that its default performance is already good enough, and it can be improved via the one million tokens of the context window. I load to the context my private databases of idiomatic translations, separated by language pairs and subject areas. After doing that, the need for manually reviewing Gemini translations is greatly diminished.
I tried swapping for my project which involves having the LLM summarize and critique medical research and didn’t have great results. The prompt I found works best with the main LLM I use fucks up the intended format when fed to other LLMs. Thinking about refining prompts for each different llm but haven’t gotten there.
My favorite personal use of Gemini right now is basically as a book club. Of course it’s not as good as my real one but I often can’t them to read the books I want and Gemini is always ready when I want to explore themes. It’s often more profound than the book club too and seems a bit less likely to tunnel vision. Before LLMs I found exploring book themes pretty tedious, often I would have to wait a while to find someone who had read it but now I can get into it as soon as I’m done reading.
I can throw a pile of NDAs at it and it neatly pulls out relevant stuff from them within a few seconds. The huge context window and excellent needle in a haystack performance is great for this kind of task.
The NIAH performance is a misleading indicator for performance on the tasks people really want the long context for. It's great as a smoke/regression test. If you're bad on NIAH, you're not gonna do well on the more holistic evals.
But the long context eval they used (MRCR) is limited. It's multi-needle, so that's a start, but its not evaluating long range dependency resolution nor topic modeling, which are the things you actually care about beyond raw retrieval for downstream tasks. Better than nothing, but not great for just throwing a pile of text at it and hoping for the best. Particularly for out-of-distribution token sequences.
I do give google some credit though, they didn't try to hide how poorly they did on that eval. But there's a reason you don't see them adding RULER, HELMET, or LongProc to this. The performance is abysmal after ~32k.
EDIT: I still love using 2.5 Pro for a ton of different tasks. I just tend to have all my custom agents compress the context aggressively for any long context or long horizon tasks.
> The performance is abysmal after ~32k.
Huh. We've not seen this in real-world use. 2.5 pro has been the only model where you can throw a bunch of docs into it, give it a "template" document (report, proposal, etc), even some other-project-example stuff, and tell it to gather all relevant context from each file and produce "template", and it does surprisingly well. Couldn't reproduce this with any other top tier model, at this level of quality.
We're a G-suite shop so I set aside a ton of time trying to get 2.5 pro to work for us. I'm not entirely unhappy with it, its a highly capable model, but the long context implosion significantly limits it for the majority of task domains.
We have long context evals using internal data that are leveraged for this (modeled after longproc specifically) and the performance across the board is pretty bad. Task-wise for us, it's about as real world as it gets, using production data. Summarization, Q&A, coding, reasoning, etc.
But I think this is where the in-distribution vs out-of-distribution distinction really carries weight. If the model has seen more instances of your token sequences in training and thus has more stable semantic representations of them in latent space, it would make sense that it would perform better on average.
In my case, the public evals align very closely with performance on internal enterprise data. They both tank pretty hard. Notably, this is true for all models after a certain context cliff. The flagship frontier models predictably do the best.
MRCR does go significantly beyond multi-needle retrieval - that's why the performance drops off as a function of context length. It's still a very simple task (reproduce the i^th essay about rocks), but it's very much not solved.
See contextarena.ai and the original paper https://arxiv.org/abs/2409.12640
It also seems to match up well with evals like https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...
The other evals you mention are not necessarily harder than this relatively simple one..
Sure. I didn't imply (or didn't mean to imply at least) that I thought MRCR was solved, only pointing out that it's closer to testing raw retrieval than it is testing long range dependency resolution like Longproc does. If retrieval is great but the model still implodes on the downstream task, the benchmark doesn't tell you the whole story. The intent/point of my original comment was that even the frontier models are nowhere near as good at long context tasks than what I see anecdotally claimed about them in the wild.
> The other evals you mention are not necessarily harder than this relatively simple one.
If you're comparing MRCR to for example Longproc, I do think the latter is much harder. Or at least, much more applicable to long-horizon task domains where long context accumulates over time. But I think it's probably more accurate to say its a more holistic, granular eval by comparison.
The tasks require the model to synthesize and reason over information that is scattered throughout the input context and across previously generated output segments. Additionally, the required output is lengthy (up to 8K tokens) and must adhere to a specific, structured format. The scoring is also more flexible than MRCR: you can use row-level F1 scores for tables, execution-based checks for code, or exact matches for formatted traces.
Just like NIAH, I don't think MRCR should be thrown out wholesale. I just don't think it can be pressed into the service of representing a more realistic long context performance measure.
EDIT: also wanted to note that using both types of evals in tandem is very useful for research and training/finetuning. If Longproc tanks and you dont have the NIAH/MRCR context, its hard to know what capabilities are regressing. So using both in a hybrid eval approach is valuable in certain contexts. For end users only trying to guage the current inference-time performance, I think evals like RULER and Longproc have a much higher value.
Right, the way I see it, MRCR isn't a retrieval task in the same vein as RULER. It’s less about finding one (or multiple) specific facts and more about piecing together scattered information to figure out the ordering of a set of relevant keys. Of course, it’s still a fairly simple challenge in the grand scheme of things.
LongProc looks like a fantastic test for a different but related problem, getting models to generate long answers. It seems to measure a skill the others don't. Meanwhile, RULER feels even more artificial than MRCR, since it's almost entirely focused on that simple "find the fact" skill.
But I think you're spot-on with the main takeaway, and the best frontier models are still struggling with long context. The DeepMind team points this out in the paper with that Pokemon example and the MRCR evaluation scores themselves.
I’ve found the 2.5 pro to be pretty insane at math. Having a lot of fun doing math that normally I wouldn’t be able to touch. I’ve always been good at math, but it’s one of those things where you have to do a LOT of learning to do anything. Being able to breeze through topics I don’t know with the help of AI and a good CAS + sympy and Mathematica verification lets me chew on problems I have no right to be even thinking about considering my mathematical background. (I did minor in math.. but the kinds of problems I’m chewing on are things people spend lifetimes working on. That I can even poke at the edges of them thanks to Gemini is really neat.)
Gemini Flash 2.0 is an absolute workhorse of a model at extremely low cost. It's obviously not going to measure up to frontier models in terms of intelligence but the combination of low cost, extreme speed, and highly reliable structured output generation make it really pleasant to develop with. I'll probably test against 2.5 Lite for an upgrade here.
I want to know what use cases you're using if for it it's not confidential.
We use it by having a Large Model delegate to Flash 2.0. Let's say you have a big collection of objects and a SOTA model identifies the need to edit some properties of one of them. Rather than have the Large Model perform a tool call or structured output itself (potentially slow/costly at scale), it can create a small summary of the context and change needed.
You can then provide this Flash 2.0 and have it generate the full object or diffed object in a safe way using the OpenAPI schema that Gemini accepts. The controlled generation is quite powerful, especially if you create the schema dynamically. You can generate an arbtirarily complex object with full typing, restrict valid values by enum, etc. And it's super fast and cheap and easily parallelizable. Have 100 objects to edit? No problem, send 100 simultaneous flash 2.0 calls. It's google, they can handle it.
I use it extensively for https://lexikon.ai - in particular one part of what Lexikon does involves processing large amounts of images, and the way Google charges for vision is vastly cheaper compared to the big alternatives (OpenAI, Anthropic)
Wow, if I knew that someone was using your product on my conversation with them I'd probably have to block them.
I mean I've copy pasted conversations and emails into ChatGPT as well, it often gives good advice on tricky problems (essentially like your own personalized r/AmITheAsshole chat). This service seems to just automate that process.
I use Gemini 2.5 Flash (non thinking) as a thought partner. It helps me organize my thoughts or maybe even give some new input I didn't think of before.
I really like to use it also for self reflection where I just input my thoughts and maybe concerns and just see what it has to say.
It basically made a university physics exam for me. It almost one-shot it as well. Just uploaded some exams from previous years together with a latex template and told it to make me a similar one. Worked great. Also made it do the solutions.
Simple unstructured to structured data transformation.
I find Flash and Flash Lite are more consistent than others as well as being really fast and cheap.
I could swap to other providers fairly easily, but don't intend to at this point. I don't operate at a large scale.
I use it for https://toolong.link Youtube summaries with images because only Gemini has easy access to YouTube and it has a gigantic context window
It's very good at automatically segmenting and recognizing handwritten and badly scanned text. I use it to make spreadsheets out of handwritten petitions.
Turn local real estate agents websites to RSS to get new properties on the market before they get uploaded to real estate market place platforms.
I give it the HTML, it finds the appropriate selector for the property item and then I use a HTML to RSS tool to publish the feed
Web scraping - creating semi-structured data from a wide variety of horrific HTML soups.
Absolutely do swap out models sometimes, but Gemini 2.0 Flash is the right price/performance mix for me right now. Will test Gemini 2.5 Flash-Lite tomorrow though.
I've yet to run out of free image gen credits with Gemini, so I use it for any low-effort image gen like when my kids want to play with it or for testing prompts before committing my o4 tokens for better quality results.
Yes, we implemented a separate service internally that interfaces with an LLM and so the callers can be agnostic as to what provider or model is being used. Haven't needed to load balance between models though.
Low-latency LLM for my home automation. Anecdotally, Gemini was much quicker than OpenAI in responding to simple commands.
In general, when I need "cheap and fast" I choose Gemini.
6.33X increase in the price of Audio processing compared to 2.0 Flash-Lite
Gemini 2.5 Flash Lite (Audio Input) - $0.5/million tokens
Gemini 2.0 Flash Lite (Audio Input) - $0.075/million tokens
Wonder what led to such a high bump in Audio token processing
I had a great-ish result from 2.5 Pro the other day. I asked it to date an old photograph, and it successfully read the partial headline on a newspaper in the background (which I had initially thought was too small/blurry to make out) and identified the 1980s event it was reporting. Impressive. But then it confidently hallucinated the date of the article (which I later verified by checking in an archive).
I run a batch inference/LLM data processing service and we do a lot of work around cost and performance profiling of (open-weight) models.
One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.
At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.
Aren't advances in KV caching making compute cost not quite quadratic?
Considering moving from Groq Llama 3.3 70b to Gemini 2.5 Flash Lite for one of my use cases. Results are coming in great, and it's very fast (important for my real-time user perception needs).
What kind of rate limits do these new Gemini models have?
Are you using Groq Llama 3.3 70b from something like cline? Is it free and what are the API query limits?
I'm using it from their HTTP API. Limits I can't remember what they were initially tbh, I had to reach out through backchannels to get it increased to 300,000 tokens per minute.
Wishing they release the Gemini Diffusion model. It'll quickly replace the default model for Aider.
It feels to me like properly instrumented, these diffusion models are going to be really powerful coding tools. Imagine a “smart” model carving out a certain number of tokens in a response for each category of response output, then diffusing the categories.
Why do you think so? I've played with the Diffusion model a bit and it makes a lot of mistakes
Gemini 2.5 doesn’t get enough credit for the quality of its writing in non-code (eg law) topics. It’s definitely a notch below Claude 4, but well ahead of ChatGPT 4o, 4.5, o3.
for anyone, who was expecting more news: the GA models benchmark basically the same as the last preview models. It's really just Google telling us that we get less api errors and this model will have a checkpoint for a longer time.
I'm glad that they standardized pricing for the thinking vs non-thinking variant. A couple weeks ago I accidentally spent thousands of extra dollars by forgetting to set the thinking budget to zero. Forgetting a single config parameter should not automatically raise the model cost 5X.
[edit] I'm less excited about this because it looks like their solution was to dramatically raise the base price on the non-thinking variant.
I switched to 2.5 Flash (non-think) for most of my projects because it was such a good model with good pricing.
Cost is an important factor so hoping that flash-lite is sufficient, even tho its somtimes more than 50% worse in relevant benchmarks which sucks.
Was also just looking at 4.1-mini but thats more expensive and often scores around the same as flash-lite in benchmarks (except coding which i dont care about).
Crazy to think that even after this move by google, openai is still the worse option for me, at least regarding API. Other than API im actually using chatgpt (o3/o4-mini, 4o is a joke) a lot more again lately after 2.5 Pro got nerfed
I really wish all the AI companies would down tools on all development until they work out file downloads, ftp, sftp, git ANY way to access the files other than copy paste and “download file”.
The workflow is crushingly tedious.
And no I don’t want to use an AI IDE or some other tool. I like the UI of Gemini chat and AI Studio and I want them improved.
Blended price (assuming 3:1 for input:output tokens) is 3.24x of what was stated before [1], and now nearly 5x of 2.0 Flash. Makes 2.0 Flash a still competitive option for many use-cases, particularly ones that aren't coding-heavy I think. A slightly poorer performing model can net perform better through multiple prompt passes. Bummer, was hoping 2.5 Flash would be a slam dunk choice.
[1] - https://web.archive.org/web/20250616024644/https://ai.google...
Good luck using 2.5 for anything non-trivial.
I have about 500,000 news articles I am parsing. OpenAI models work well but found Gemini had fewer mistakes.
Problem is; they give me a terrible 10k RPD limit. To increase to the next tier, they then require a minimum amount of spending but I can't reach that amount even when maxing the RPD limit for multiple days in a row.
I emailed them twice and completed their forms but everyone knows how this works. So now I'm back at OpenAI, with a model with a bit more mistakes but that won't 403 me after half an hour of using it due to their limits.
The rate limits apply only to the Gemini API. There is also Vertex from GCP, which offers the same models (and even more, such as Claude) at the same pricing, but with much higher rate limits (basically none, as long as they don't need to cut anyone off with provisioned throughput iiuc) and with a process to get guaranteed throughput.
Had no idea... always thought Vertex was just a way to do enterprise offering!
I'm guessing now that it is GA this won't be a problem.
I wish! The tier-based limits are still the same!
At least it's more expensive now so I guess I will be able to hop to the next tier sooner? ¯\_(ツ)_/¯
It's a bummer that 2.5 Pro is still removed from the free tier of the API.
I have a huge background.js file from a now removed browser extension that the Devs made into a single line. Around 800KB of a single line file I think....
I tried many free stuff to try to refactor it but they all loose context window quickly.
There are myriad non-LM tools that can deobfuscate and prettify JS. I used them with success long before LLMs were en vogue.
Link?
Which extension is it?
Gstzen : peaceful compliance. Not in store right now,
Anyone else unable to access 2.5-pro via api? I'm currently getting "Publisher Model `projects/349775993245/locations/us-west4/publishers/google/models/gemini-2.5-pro` was not found or your project does not have access to it. Please ensure you are using a valid model version."
Not sure where else to post this, but when attempting to use any of the Gemini 2.5 models via API, I receive an "empty content" response about 50% of the time. To be clear, the API responds successfully, but the `content` returned by the LLM is just an empty string.
Has anyone here had any luck working around this problem?
What finish reason are you getting? Perhaps your code sets a low max_tokens, so the generation stops while the model is still thinking, without giving any actual output.
The finish reason is `length`. I have tried setting minimal token budgets, really small prompts, and max lengths of various sizes from 100-4000 and nothing seems to make a consistent dent in the behavioral pattern.
This can happen if the prompt or response is blocked by a safety filter. Check some of the other fields in the response.
2.5 Flash Lite seems better at everything compare to 2.0 Flash Lite with the only exception being SimpleQA, so there is probably a small tradeoff on pop culture knowledge for coding, math, science, reasoning and multimodal tasks.
Classic bait-and-switch to make developers build things on top off models for 2 months, and then raise input price by 2x and output by 4x. But hey, it's Google, wouldn't expect anything else from an advertising company.
I am always disappointed when I compare the answers to the same queries on 2.5 Pro vs. o4-mini/o3. But trying out the same query in AI Studio gives much better results, closer to OpenAI's models. What is wrong with 2.5 Pro in the Gemini app? I can't believe that the model in their consumer app would produce the same benchmark results as 2.5 Pro in the API or AI Studio.
The models in the Gemini app are nerfed in comparison to those in AI Studio: they have less thinking budget, output less tokens, and have various safety filters. There’s certainly a trade-off between using AI Studio for its better performance and using the API or the Gemini app in a way that doesn’t involve Google keeping your data for training purposes.
I don't have any inside information, but I'm sure there are different system prompts used in the Gemini chat interface vs the API. On OpenAI/ChatGPT they're sometimes dramatically different.
Gemini strangely says you cannot upload all sorts of file types.
But it accepts them just fine if you upload a zip file……. which you can only do in AI studio.
been testing gemini flash lite. latency is good, responses land under 400ms most times. useful for low-effort rewrites or boilerplate filler. quality isn’t stable though : context drifts after 4-5 turns, especially with anything recursive or structured. tried tagging it into prompt chains but fallback logic ends up too aggressive. good for assist, not for logic, wouldn't anchor anything serious on it yet
Anyone else unable to access 2.5-pro via api? I'm currently getting "Publisher Model `projects/349775993245/locations/us-west4/publishers/google/models/gemini-2.5-pro` was not found or your project does not have access to it. Please ensure you are using a valid model version."
Is there a Codex/Claude Code competitor on the way?
Jules?
I dream of a day when LLM naming follows a convention.
I need an AI model to be able to keep track of the AI model names.
i cancelled chatgpt early this year and switched to Gemini, with Gemini making progress rapidly, i wonder if openai already lost the battle
I tried using the three new models to transcribe the audio of this morning's Gemini Twitter Space.
I got very strong results from 2.5 Pro and 2.5 Flash, but 2.5 Flash Lite sadly got stuck in a loop until it ran out of output tokens:
Um, like, what did the cows bring to you? Nothing. And then, um, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and...
Notes on my results (including the transcripts which worked, which included timestamps and guessed speaker names) here: https://simonwillison.net/2025/Jun/17/gemini-2-5/#transcribi...
I mean the model names are always a bit odd, but flash-lite is particulary good!
kinda sounds like something else
Which are what?
[flagged]
in a very real sense, Gemini (etc) _are_ fixed search.
If everyone searches through Gemini, then there are no click conversions. Without click conversions there is no incentive for many websites to make new content. Without content there is nothing new to learn for Gemini. It's an Ouroboros problem.
So you mean only people who want to post something for the fun of posting it will post and we won't get as much corporate crap/SEO? Darn. Of course practically this means people who post will have a reason to post and that reason might be to influence Gemini, so double darn.
I said a majority. Say stack overflow, medium articles, online newspapers, they live off of click conversion. The internet is more than people just posting for fun. Of course everyone only posting for fun would be ideal.
But even if you only post for fun, if most people only find your content by seeing it re-hashed by a LLM that hallucinates some other stuff in the middle, I'd probably also lose the fun in posting for posting sake.
For me, I actually feel much more motivated to write good and accurate documentation knowing there will be at least one reader who is going to look at it very closely and will attempt to synthesize useful information from it.
Same with my old open-source projects, it's kinda cool knowing that all the old stuff that nobody would have ever looked at anymore is now part of a humanity-wide corpus of useful knowledge on how to do X with language Y.
Yeah, a lot of the web is at-risk if searchers just read LLM summaries and don't click through, but we will have Skynet before that becomes a real issue and then this will all be irrelevant.
Fortunately it was trained on reddit comments.
Oh no.
Stack overflow makes a profit off click conversions. It does not exist because of it. People would have built a version regardless of the profit margin involved.
Running the site costs money. So you either run ads or make users pay. Either way, usage of LLMs for searching would decrease the income for these sites.
You seem to be assuming that there will be no paid ads in Gemini output. It's hard for me to believe that is going to be true in five years.
Nobody cares about search anymore
Because it sucked and was filled with SEO spam generated by AI.
No, because AI is more concise, accurate and chat-able.