lairv 21 hours ago

Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data

Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively

Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days

  • thomasahle 20 hours ago

    I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.

    Sadly, the answer was wrong.

    It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.

    Still a useful tool though. It definitely gets the majority of the insights.

    Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

    • nomilk 7 hours ago

      Terrence Tao claims [0] contributions by the public are counter-productive since the energy required to check a contribution outweighs its benefit:

      > (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking

      Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.

      [0] https://www.youtube.com/watch?v=HUkBz-cdB-k&t=2h59m33s

      • red75prime 4 hours ago

        Unlike general public the models can be trained. I mean if you train a member of general public, you've got a specialist, who is no longer a member of general public.

        • jacquesm 18 minutes ago

          Unlike the general public though, these models have advanced dementia when it comes to learning from corrections, even within a single session. They keep regressing and I haven't found a way to stop that yet.

          What boggles the mind: we have gone for so long to try to strive for correctness and suddenly being right 70% of the time and wrong the remaining 30% is fine. The parallel with self driving is pretty strong here: solving 70% of the cases is easy, the remaining 30% are hard or maybe even impossible. Statistically speaking these models do better than most humans, most of the time. But they do not do better than all humans, and they can't do it all of the time and when they get it wrong they make such tremendously basic mistakes that you have to wonder how they manage to get things right.

          Maybe it's true that with an ever increasing model size and more and more (proprietary, the public sources are exhausted by now so private data is the frontier where model owners can still gain an edge) we will reach a point where the models will be right 98% of the time or more but what would be the killer feature for me is an indication of the confidence level of the output. Because no matter whether junk or pearls it all looks the same and that is more dangerous than having nothing at all.

      • jmmcd 2 hours ago

        But he actually uses frontier LLMs in his own work. Probably that's stronger evidence.

    • pennaMan 16 hours ago

      > It also returned 8 "sources"

      well, there's your problem. it behaves like a search summary tool and not like a problem solver if you enable google search

      • factsaresacred 15 hours ago

        Exactly this - and how chatGPT behaves too. After a few conversations with search enabled you figure this out, but they really ought to make the distinction clearer.

    • JBiserkov 20 hours ago

      The requested prompt does not exist or you do not have access. If you believe the request is correct, make sure you have first allowed AI Studio access to your Google Drive, and then ask the owner to share the prompt with you.

      • junon 20 hours ago

        I thought this was a joke at first. It actually needs drive access to run someone else's prompt. Wild.

        • jacquesm 17 minutes ago

          Not a chance I'll ever click 'ok'. I'd love to be able to opt-out of anything AI related near my google environment.

        • ashdksnndck 19 hours ago

          On iOS safari, it just says “Allow access to Google Drive to load this Prompt”. When I run into that UI, my first instinct is that the poster of the link is trying to phish me. That they’ve composed some kind of script that wants to read my Google Drive so it can send info back to them. I’m only going to click “allow” if I trust the sender with my data. IMO, if that’s not what is happening, this is awful product design.

        • cavisne 10 hours ago

          After ChatGPT accidentally indexed everyones shared chats (and had a cache collision in their chat history early on) and Meta build a UI flow that filled a public feed full of super private chats... seems like a good move to use a battle tested permission system.

        • dormento 19 hours ago

          Imagine the metrics though. "this quarter we've had a 12% increase on people using AI solutions in their google drive".

          • hedora 11 hours ago

            Google Drive is one of the bigger offenders when it comes to “metrics-driven user-hostile changes”, in gsuite, and its Google Meet is one of its peers.

            • edoceo 11 hours ago

              In The Wire they asked Bunny to "juke the stats" - and he was having none of that.

        • tart-lemonade 12 hours ago

          Not really, that's just basic access control. If you've used Colab or Cloud Shell (or even just Google Cloud in general, given the need to explicitly allow the usage of each service), it's not surprising at all.

    • TechDebtDevin 15 hours ago

      Why is this sad. You should bw rooting for these LLMs to be as bad as possible..

      • windexh8er 13 hours ago

        If we've learned anything so far it's that the parlor tricks of one-shot efficacy only gets you so far. Drill into anything relatively complex with a few hundred thousand tokens of context and the models all start to fall apart roughly the same. Even when I've used Sonnet 4.5 with 1M token context the model starts to flake out and get confused with a codebase of less than 10k LoC. Everyone seems to keep claiming these huge leaps and bounds, but I really have to wonder how many of these are just shilling for their corporate overlord. I asked Gemini 3 to solve a simple, yet not well documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.

        • jorvi 12 hours ago

          Same. I've been needing to update an userscript (JS) that takes stuff like "3 for the price of 1", "5 + 1 free", "35% discount!" from a particular site and then converts the price to a % discount and the price per item / 250 grams.

          Its an old userscript so it is glitchy and halfway works. I already pre-chewed the work by telling Gemini 3 exactly which new HTML elements it needs to match and which contents it needs to parse. So basically, the scaffolding is already there, the sources are already there, it just needs to put everything in place.

          It fails miserably and produces very convincing looking but failing code. Even letting it iterate multiple times does nothing, nor does nudging it in the correct direction. Mind you that Javascript is probably the most trained-on language together with Python, and parsing HTML is one of the most common usecases.

          Another hilarious example is MPV, which has very well-documented settings. I used to think that LLMs would mean you can just tell people to ask Gemini how to configure it, but 9 out of 10 times it will hallucinate a bunch of parameters that never existed.

          It gives me an extremely weird feeling when other people are cheering that it is solving problems at superhuman speeds or that it coded a way to ingest their custom XML format in record time, with relatively little prompting. It seems almost impossible that LLMs can both be so bad and so good at the same time, so what gives?

          • marcus_holmes 9 hours ago

            1. Coding with LLMs seems to be all about context management. Getting the LLM to deal with the minimum amount of code needed to fix the problem or build the feature, carefully managing token limits and artificially resetting the session when needed so the context handover is managed, all that. Just pointing an LLM at a large code base and expecting good things doesn't work.

            2. I've found the same with Gemini; I can rarely get it to actually do useful things. I have tried many times, but it just underperforms compared to the other mainstream LLMs. Other people have different experiences, though, so I suspect I'm holding it wrong.

            • lan321 5 hours ago

              The problem is by that point it's much less useful in projects. I still like them but when I get to the point of telling it exactly what to do I'm mostly just being lazy. It's useful in that it might give me some ideas I didn't consider but I'm not sure it's saving time.

              Of course, for short one-off scripts, it's amazing. It's also really good at preliminary code reviews. Although if you have some awkward bits due to things outside of your power it'll always complain about them and insist they are wrong and that it can be so much easier if you just do it the naive way.

              Amazon's Kiro IDE seems to have a really good flow, trying to split large projects into bite sized chunks. I, sadly, couldn't even get it to implement solitaire correctly, but the idea sounds good. Agents also seem to help a lot since it can just do things from trial and error, but company policy understandably gets complicated quick if you want to provide the entire repo to an LLM agent and run 'user approved' commands it suggests.

              • rescbr 3 hours ago

                From my experience vibe coding, you spend a lot of time preparing documentation and baseline context for the LLM.

                On one of my projects, I downloaded a library’s source code locally, and asked Claude to write up a markdown file explaining documenting how to use it with examples, etc.

                Like, taking your example for solitaire, I’d ask a LLM to write the rules into a markdown file and tell the coding one to refer to those rules.

                I understand it to be a bit like mise en place for cooking.

                • lan321 2 hours ago

                  It's kind of what Kiro does.

                  You tell it what you want and it gives you a list of requirements, which are in that case mostly the rules for Solitaire.

                  You adjust those until you're happy, then you let it generate tasks, which are essentially epics with smaller tickets in order of dependency.

                  You approve those and then it starts developing task by task where you can intervene at any time if it starts going off track.

                  The requirements and tasks, it does really well, but the connection of the epics/larger tasks is where it crumbles mostly. I could have made it work with some more messing around but I've noticed over a couple projects that, at least in my tries, it always crumbles either at the connection of the epics/large tasks or when you ask it to do a small modification later down the line and it causes a lot of smaller, subtle changes all over the place. (could say skill issue since I oversaw something in the requirements, but that's kind of how real projects go, so..)

                  It also eats tokens like crazy for private usage but that's more so a 'playing around' problem. As it stands I'll probably blow 100$ a day if I connect it to an actual commercial repo and start experimenting. Still viable with my salary, but still..

        • mlrtime 3 hours ago

          >documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.

          This is mostly because HA changes so frequently and the documentation is sparse. To get around this and increase my correction rate, I give it access to the source code of the same version I'm running. Then instructions in CLAUDE.md on where to find source and it must use source code.

          This fixes 99% of my issues.

          • windexh8er 2 hours ago

            For this issue, additional Media Player storage locations, the configuration is actually quite old.

            It does showcase that LLMs don't truly "think" when it's not even able to search for and find the things mentioned. But, even then this configuration has been stable for years and the training data should have plenty of mentions.

        • DrewADesign 13 hours ago

          It depends on your definition of safe. Most of the code that gets written is pretty simple — basic crud web apps, WP theme customization, simple mobile games… stuff that can easily get written by the current gen of tooling. That already has cost a lot of people a lot of money or jobs outright, and most of them probably haven’t reached their skill limit a as developers.

          As the available work increases in complexity, I reckon more will push themselves to take jobs further out of their comfort zone. Previously, the choice was to upskill for the challenge and greater earnings, or stay where you are which is easy and reliable; the current choice is upskill or get a new career. Rather than switch careers to something you have zero experience in. That puts pressure on the moderately higher-skill job market with far fewer people, and they start to upskill to outrun the implosion, which puts pressure on them to move upward, and so on. With even modest productivity gains in the whole industry, it’s not hard for me to envision a world where general software development just isn’t a particularly valuable skill anymore.

          • windexh8er 13 hours ago

            Everything in tech is cyclical. AI will be no different. Everyone outsourced, realized the pain and suffering and corrected. AI isn't immune to the same trajectory or mistakes. And as corporations realize that nobody has a clue about how their apps or infra run, you're one breach away from putting a relatively large organization under.

            The final kicker in this simple story is that there are many, many narcissistic folks in the C-suite. Do you really think Sam Altman and Co are going to take blame for Billy's shitty vibe coded breach? Yeah right. Welcome to the real world of the enterprise where you still need an actual throat to choke to show your leadership skills.

            • DrewADesign 12 hours ago

              I absolutely don’t think vibe coding or barely supervised agents will replace coders, like outsourcing claimed to, and in some cases did and still does. And outsourcing absolutely affected the job market. If the whole thing does improve and doesn’t turn out to be too wildly unprofitable to survive, what it will do is allow good quality coders— people who understand what can and can’t go without being heavily scrutinized— to do a lot more work. That is a totally different force than outsourcing, which to some extent, assumed software developers were all basically fungible code monkeys at some level.

              • windexh8er an hour ago

                There's a lot to unpack here. I agree - outsourcing did affect the job market. You're just seeing the negative (US) side. If anything outsourcing was hugely beneficial to the Indian market where most of those contracts landed. My point was that it was sold as a solution that didn't net the value proposition it claimed. And that is why I've said AI is not immune to being cyclical, just like outsourcing. AI is being sold as worker replacement. It's not even close and if it were then OpenAI, Anthropic and Google would have all replaced a lot of people and wouldn't be allowing you and I to use their tool for $20/month. When it does get that good we will no longer be able to afford using these "enterprise" tools.

                With respect to profitability - there's none in sight. When JP Morgan [0] is saying that $650B in annual revenue is needed to make a paltry 10% on investment there is no way any sane financial institution would pump more money into that sunk cost. Yet, here we are building billions of dollars in datacenters for what... Mediocre chat bots? Again these thing don't think. They don't reason. They're massive word graphs being used in clever ways with cute, humanizing descriptions. Are they useful for helping a human parse way more information than we can reason about at once? For sure! But that's not worth trillions in investment and won't yield multiples of the input. In fact I'd argue the AI landscape would be much better off if the dollars stopped flowing because that would mean real research would need to be done in a much more efficient and effective manner. Instead we're paying individual people hundreds of millions of dollars who, and good for them, have no clue or care on what actually happens with AI because: money in the bank. No, AI in it's current form is not profitable, and it's not going to be if we continue down this path. We've literally spent world changing sums of money on models that are used to create art that will displace the original creators well before they will solve any level of useful world problems.

                Finally, and to your last point: "...good quality coders...". How long do you think that will be a thing with respect to how this is all unfolding? Am I writing better code (I'm not a programmer by day) with LLMs? Yes and no. Yes when I need to build a visually appealing UI for something. And yes when it comes to a framework. But what I've found is if I don't put all of the right pieces in the right places before I start I end up with an untenable mess into the first couple thousand lines of that code. So if people stop becoming "good quality programmers" then what? These models only get better with better training data and the web will continue to go insular against these IP stealing efforts. The data isn't free, it never has been. And this is why we're now hearing the trope of "world models". A way to ask for trillions more to provide millionths of a penny on the invested dollar.

                [0] https://www.tomshardware.com/tech-industry/artificial-intell...

      • onoesworkacct 14 hours ago

        That ship has sailed long ago.

        I'm rooting for biological cognitive enhancement through gene editing or whatever other crazy shit. I do not want to have some corporation's AI chip in my brain.

      • drusepth 14 hours ago

        Generally, any expert hopes their tool/paintbrush/etc is as performant as possible.

        • eru 8 hours ago

          And in general I'm all for increasing productivity, in all areas of the economy.

          • elbear 2 hours ago

            To what goal?

      • MaximusLegroom 14 hours ago

        Rooting is useless. We should be taking conscious action to reduce the bosses' manipulation of our lives and society. We will not be saved by hoping to sabotage a genuinely useful technology.

        • TechDebtDevin 14 hours ago

          How is it useful other than for people making money off token outout. Continue to fry your brain.

          • antonvs 13 hours ago

            They’re fantastic learning tools, for a start. What you get out of them is proportional to what you put in.

            You’ve probably heard of the Luddites, the group who destroyed textile mills in the early 1800s. If not: https://en.wikipedia.org/wiki/Luddite

            Luddites often get a bad rap, probably in large part because of employer propaganda and influence over the writing of history, as well as the common tendency of people to react against violent means of protest. But regardless of whether you think they were heroes, villains, or something else, the fact is that their efforts made very little difference in the end, because that kind of technological progress is hard to arrest.

            A better approach is to find ways to continue to thrive even in the presence of problematic technologies, and work to challenge the systems that exploit people rather than attack tools which can be used by anyone.

            You can, of course, continue to flail at the inevitable, but you might want to make sure you understand what you’re trying to achieve.

            • eru 8 hours ago

              Arguably the Luddites don't get a bad enough rep. The lump of labour fallacy was as bad then as it is now or at any other time.

              https://en.wikipedia.org/wiki/Lump_of_labour_fallacy

              • antonvs 7 hours ago

                Again, that may at least in part be a function of how history was written. The Luddite wikipedia link includes this:

                > Malcolm L. Thomas argued in his 1970 history “The Luddites” that machine-breaking was one of the very few tactics that workers could use to increase pressure on employers, undermine lower-paid competing workers, and create solidarity among workers. "These attacks on machines did not imply any necessary hostility to machinery as such; machinery was just a conveniently exposed target against which an attack could be made."[10] Historian Eric Hobsbawm has called their machine wrecking "collective bargaining by riot", which had been a tactic used in Britain since the Restoration because manufactories were scattered throughout the country, and that made it impractical to hold large-scale strikes.

                Of course, there would have been people who just saw it as striking back at the machines, and leaders who took advantage of that tendency, but the point is it probably wasn’t as simple as the popular accounts suggest.

                Also, there’s a kind of corollary to the lump of labor fallacy, which is arguably a big reason the US is facing such a significant political upheaval today: when you disturb the labor status quo, it takes time - potentially even generations - for the economy to adjust and adapt, and many people can end up relatively worse off as a result. Most US factory workers and miners didn’t end up with good service industry jobs, for example.

                Sure, at a macro level an economist viewing the situation from 30,000 feet sees no problem - meanwhile on the ground, you end up with millions of people ready to vote for a wannabe autocrat who promises to make things the way they were. Trying to treat economics as a discipline separate from politics, sociology, and psychology in these situations can be misleading.

                • eru 6 hours ago

                  > [...] undermine lower-paid competing workers, and create solidarity among workers.

                  Nice 'solidarity' there!

                  > Most US factory workers and miners didn’t end up with good service industry jobs, for example.

                  Which people are you talking about? More specifically, when?

                  As long as overall unemployment stays low and the economy keeps growing, I don't see much of a problem. Even if you tried to keep everything exactly as is, you'll always have some people who do better and some who do worse; even if just from random chance. It's hard to blame that on change.

                  See eg how the draw down of the domestic construction industry around 2007 was handled: construction employment fell over time, but overall unemployment was low and flat. Indicating an orderly shuffling around of workers from construction into the wider economy. (As a bonus point, contrast with how the Fed unnecessarily tanked the wider economy a few months after this re-allocation of labour had already finished.)

                  > Sure, at a macro level an economist viewing the situation from 30,000 feet sees no problem - meanwhile on the ground, you end up with millions of people ready to vote for a wannabe autocrat who promises to make things the way they were. Trying to treat economics as a discipline separate from politics, sociology, and psychology in these situations can be misleading.

                  It would help immensely, if the Fed were more competent in preventing recessions. Nominal GDP level targeting would help to keep overall spending in the economy on track.

                  • derangedHorse 2 hours ago

                    The Fed is capable of doing no such thing. They can soften or delay recessions by socializing mistakes and redistributing wealth using interest rates, but an absence of recessions would imply perfect market participants.

            • melagonster 8 hours ago

              Today, we found better ways to prevent machines from crushing children, e.g., more regulation from democracy.

          • nh23423fefe 14 hours ago

            are you pretending to be confused?

            • ImaLuddite 12 hours ago

              I see millions of kids cheating on their schoolwork, many adults substituting reading and thinking to GPUs. There's like 0.001% of people that use them to learn responsibly. You are genuinely a fool.

              • antonvs 10 hours ago

                Hey, I wrote a long response to your other reply to me, but your comment seems to have been flagged so I can no longer reply there. Since I took the time to write that, I'm posting it here.

                I'm glad I was able to inspire a new username for you. But aren't you concerned that if you let other people influence you like that, you're frying your brain? Shouldn't everything originate in your own mind?

                > They don't provide any value except to a very small percentage of the population who safely use them to learn

                There are many things that only a small percentage of the population benefit from or care about. What do you want to do about that? Ban those things? Post exclamation-filled comments exhorting people not to use them? This comes back to what I said at the end of my previous comment:

                You might want to make sure you understand what you’re trying to achieve.

                Do you know the answer to that?

                > A language model is not the same as a convolution neural network finding anomalies on medical imagining.

                Why not? Aren't radiologists "frying their brains" by using these instead of examining the images themselves?

                The last paragraph of your other comment was literally the Luddite argument. (Sorry I can't quote it now.) Do you know how to weave cloth? No? Your brain is fried!

                The world changes, and I find it more interesting and challenging to change with it, than to fight to maintain some arbitrary status quo. To quote Ghost in the Shell:

                All things change in a dynamic environment. Your effort to remain what you are is what limits you.

                For me, it's not about "getting ahead" as you put it. It's about enjoying my work, learning new things. I work in software development because I enjoy it. LLMs have opened up new possibilities for me. In that 5 year future you mentioned, I'm going to have learned a lot of things that someone not using LLMs will not have.

                As for being dependent on Altman et al., you can easily go out and buy a machine that will allow you to run decent models yourself. A Mac, a Framework desktop, any number of mini PCs with some kind of unified memory. The real dependence is on the training of the models, not running them. And if that becomes less accessible, and new open weight models stop being released, the open weight models we have now won't disappear, and aren't going to get any worse for things like coding or searching the web.

                > Keep falling for lesswrong bs.

                Good grief. Lesswrong is one of the most misleadingly named groups around, and their abuse of the word "rational" would be hilarious if it weren't sad. In any case, Yudkowsky advocated being ready to nuke data centers, in a national publication. I'm not particular aware of their position on the utility of AI, because I don't follow any of that.

                What I'm describing to you is based on my own experience, from the enrichment I've experienced from having used LLMs for the past couple of years. Over time, I suspect that kind of constructive and productive usage will spread to more people.

                • TechDebtDevin 9 hours ago

                  Out of respect the time you put into your response, I will try to respond in good faith.

                  > There are many things that only a small percentage of the population benefit from or care about. What do you want to do about that?

                  ---There are many things from our society that I would like to ban that are useful to a small percentage of the population, or at least should be heavily regulated. Guns for example. A more extreme example would be cars. Many people drive 5 blocks when they could walk to their (and everyone else's) detriment. Forget the climate, it impacts everyone ( break dust, fumes, pedestrian deaths). Some cities create very expensive tolls / parking fees to prevent this, this angers most people and is seen as irrational by the masses but is necessary and not done enough. Open Free societies are a scam told to us by capitalist that want to exploit without any consequences.

                  --- I want to air-gap all computers in classrooms. I want students to be expelled for using LLMs to do assignments, as they would have been previously for plagiarism (that's all an llm is, a plagiarism laundering machine).

                  ---During COVID there was a phenomenon where some children did not learn to speak until they were 4-5 years old, and some of those children were even diagnosed with autism. In reality, we didn't understand fully how children learned to speak, and didn't understand the importance of the young brain's need to subconsciously process people's facial expressions. It was Masks!!! (I am not making a statement on masks fyi) We are already observing unpredictable effects that LLMs have on the brain and I believe we will see similar negative consequences on the young mind if we take away the struggle to read, think and process information. Hell I already see the effects on myself, and I'm middle aged!

                  > Why not? Aren't radiologists "frying their brains" by using these instead of examining the images themselves?

                  --- I'm okay with technology replacing a radiologist!!! Just like I'm okay with a worker being replaced in an unsafe textile factory! The stakes are higher in both of these cases, and obviously in the best interest of society as a whole. The same cannot be said for a machine that helps some people learn while making the rest dependent on it. Its the opposite of a great equalizer, it will lead to a huge gap in inequality for many different reasons.

                  We can all say we think this will be better for learning, that remains to be seen. I don't really want to run a worldwide experiment on a generation of children so tech companies can make a trillion dollars, but here we are. Didn't we learn our lesson with social media/porn?

                  If Uber's were subsidized and cost only $20.00 a month for unlimited rides, could people be trusted to only use it when it was reasonable or would they be taking Uber's to go 5 blocks, increasing the risk for pedestrians and deteriorating their own health. They would use them in an irresponsible way.

                  If there was an unlimited pizza machine that cost $20.00 a month to create unlimited food, people would see that as a miracle! It would greatly benefit the percentage of the population that is food insecure, but could they be trusted to not eat themselves into obesity after getting their fill? I don't think so. The affordability of food, and the access to it has a direct correlation to obesity.

                  Both of these scenarios look great on the surface but are terrible for society in the long run.

                  I could go on and on about the moral hazards of LLMs, there are many more outside of just the dangers of learning and labor. We are being told they are game changing by the people who profit off them..

                  In the past, empires bet their entire kingdom's on the words of astronomers and magicians who said they could predict the future. I really don't see how the people running AI companies are any different than those astronomers (they even say they can predict the future LOL!)

                  They are Dunning Kruger plagiarism laundering machines as I see it. Text extruding machines that are controlled by a cabal of tech billionaires who have proven time and time again they do not have societies best interest at heart.

                  I really hope this message is allowed to send!

                  • alchemism an hour ago

                    The kids went out and found the “cheating engines” for themselves. There was no plot from Big Tech, and believe me academia does not like them either.

                    They have, believe it or not, very little power to stop kids from choosing to use cheating engines on their personal laptops. Universities are not Enterprise.

                    • mavhc 25 minutes ago

                      They're just exploiting a bug in the Educational System where instead of testing if students know things, we test if they can produce a product that implies they know things. We don't interrogate them in person with questions to see if they understand the topic, we give them multiple choice questions that can be marked automatically to save time

                  • mlrtime 2 hours ago

                    Just replying that I read your post, and don't disagree with some of what you wrote, and I'm glad there are some people that peacefully/respectfully push back (because balance is good).

                    However, I don't agree that AI is a risk to the extreme levels you seem to think it is. The truth is that humans have advanced by use of technology since the first tool and we are horrible predictors at what the use case of these technologies will bring.

                    So far they have been mostly positive, I don't see a long term difference here.

                  • antonvs 7 hours ago

                    Ok, so there’s a clear pattern emerging here, which is that you think we should do much more to manage our use of technology. An interesting example of that is the Amish. While they take it to what can seem like an extreme, they’re doing exactly what you’re getting at, just perhaps to a different degree.

                    The problem with such approaches is that it involves some people imposing their opinions on others, “for their own good”. That kind of thing often doesn’t turn out well. The Amish address that by letting their children leave to experience the outside world, so that their return is (arguably) voluntary - they have an opportunity to consent to the Amish social contract.

                    But what you seem to be doing is making a determination of what’s good for society as a whole, and then because you have no way to effect that, you argue against the tools that we might abuse rather than the tendencies people have to abuse them. It seems misplaced to me. I’m not saying there are no societal dangers from LLMs, or problems with the technocrats and capitalists running it all, but we’re not going to successfully address those issues by attacking the tools, or people who are using them effectively.

                    > In the past, empires bet their entire kingdom's on the words of astronomers and magicians who said they could predict the future.

                    You’re trying to predict the future as well, quite pessimistically at that.

                    I don’t pretend to be able to predict the future, but I do have a certain amount of trust in the ability of people to adapt to change.

                    > that's all an llm is, a plagiarism laundering machine

                    That’s a possible application, but it’s certainly not all they are. If you genuinely believe that’s all they are, then I don’t think you have a good understanding of them, and it could explain some of our difference in perspective.

                    One of the important features of LLMs is transfer learning: their ability to apply their training to problems that were not directly in their training set. Writing code is a good example of this: you can use LLMs to successfully write novel programs. There’s no plagiarism involved.

      • eru 8 hours ago

        > You should bw rooting for these LLMs to be as bad as possible..

        Why?

  • rbjorklin 19 hours ago

    Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low

    I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.

    • kenjackson 15 hours ago

      ChatGPT solves this problem now as well with 5.1. Time for a new litmus test.

  • qsort 21 hours ago

    To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.

    But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.

    • nerdsniper 20 hours ago

      I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.

      I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.

      • qsort 19 hours ago

        I'm not explaining myself right.

        Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.

        Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.

        I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.

        ---

        [0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.

        • pclmulqdq 19 hours ago

          Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.

          • emodendroket 19 hours ago

            Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?

            • nerdsniper 16 hours ago

              Both, but perhaps more often neither.

              From what I've seen, sometimes the computer correctly assesses that the "bad" move opens up some kind of "checkmate in 45 moves" that could technically happen, but requires the opponent to see it 45 moves ahead of time and play something that would otherwise appear to be completely sub-optimal until something like 35 moves in, at which point normal peak grandmasters would finally go "oh okay now I get the point of all of that confusing behavior, and I can now see that I'm going to get mated in 10 moves".

              So, the computer is "right" - that move is worse if you're playing a supercomputer. But it's "wrong" because that same move is better as long as you're playing a human, who will never be able to see an absurd thread-the-needle forced play 45-75 moves ahead.

              That said, this probably isn't what GP was referring to, as it wouldn't lead to an assignment of a "brilliant" move simply for failing to see the impossible-to-actually-play line.

              • travisjungroth 15 hours ago

                This is similar to game theory optimal poker. The optimal move is predicated on later making optimal moves. If you don’t have that ability (because you’re human) then the non-optimal move is actually better.

                Poker is funny because you have humans emulating human-beating machines, but that’s hard enough to do that you have players who don’t do this win as well.

              • pclmulqdq 15 hours ago

                I think this is correct for modern engines. Usually, these moves are open to a very particular line of counterplay that no human would ever find because they rely on some "computer" moves. Computer moves are moves that look dumb and insane but set up a very long line that happens to work.

            • bionsystem 16 hours ago

              It does happen that the engine doesn't immediately see that a line is best, but that's getting very rare those days. It was funny in certain positions a few years back to see the engine "change its mind" including in older games where some grandmaster found a line that was particularly brilliant, completely counter-intuitive even for an engine, AND correct.

              But mostly what happens is that a move isn't so good, but it isn't so bad either, and as the computer will tell you it is sub-optimal, a human won't be able to refute it in finite time and his practical (as opposed to theoretical) chances are reduced. One great recent example of that is Pentala Harikrishna's recent queen sacrifice in the world cup, amazing conception of a move that the computer say is borderline incorrect, but leads to such complications and a very uncomfortable position for his opponent that it was practically a great choice.

            • pclmulqdq 19 hours ago

              It can be either one. In closed positions, it is often the latter.

            • thomasahle 18 hours ago

              It's only the later if it's a weak browser engine, and it's early enough in the game that the player had studied the position with a cloud engine.

          • jeswin 13 hours ago

            > Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good.

            Do you have any links? I haven't seen any such (forget GM, not even Magnus), barring the opponent making mistakes.

            • yunwal 9 hours ago

              Here’s a chess stackexchange of positions that stump engines

              https://chess.stackexchange.com/questions/29716/positions-th...

              It basically comes down to “ideas that are rare enough that they were never programmed into a chess engine”.

              Blockades or positions where no progress is possible are a common theme. Engines will often keep tree searching where a human sees an obvious repeating pattern.

              Here’s also an example where 2 engines are playing, and deep mind finds a move that I think would be obvious to most grandmasters, yet stockfish misses it https://youtu.be/lFXJWPhDsSY?si=zaLQR6sWdEJBMbIO

              That being said, I’m not sure that this necessarily correlates with brilliancy. There are a few of these that I would probably get in classical time and I’m not a particularly brilliant player.

              • TheRoque 6 hours ago

                Stockfish totally dropped hand crafted evaluations in 2023.

            • bluecalm 6 hours ago

              It used to happen way more often with Magnus and classical versions of Stockfish from pre Alpha Zero/Leela Zero days. Since NN Stockfish I don't think it happens anymore.

            • typpilol 11 hours ago

              Maybe he means not the best move but an equally almost strong move?

              Because ya, that doesn't happen lol.

        • pmarreck 15 hours ago

          I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning. How can I do so? (I don't inhabit any of the current chess spaces so have no idea where to look, but my son is approaching the age where I can start to teach him...).

          That said, chess is such a great human invention. (Go is up there too. And texas no-limit hold'em poker. Those are my top 3 votes for "best human tabletop games ever invented". They're also, perhaps not uncoincidentally, the hardest for computers to be good at. Or, were.)

          • elevatortrim 7 hours ago

            > I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning.

            If you want to see this against someone like Magnus, it is rare as super GMs do not spend a lot of time playing engines publicly.

            But if you want to see them against a normal chess master somewhere between master and international master, it is every where. For e.g. this guy analyses his every match afterwards and you frequently here "oh I would never see that line":

            https://www.youtube.com/playlist?list=PLp7SLTJhX1u6zKT5IfRVm...

            (start watching around 1000+ for frequently seeing those moments)

          • qsort 14 hours ago

            The problem is that Stockfish is so strong that the only way to have it play meaningful games is to put it against other computers. Chess engines play each other in automated competitions like TCEC.

            If you look on Youtube there are many channels where strong players analyze these games. As Demis Hassabis once put it, it's like chess from another dimension.

          • mquander 14 hours ago

            I recommend Matthew Sadler's Game Changer and The Silicon Road To Chess Improvement.

        • nerdsniper 17 hours ago

          You explained yourself right. The issue is that you keep qualifying your statements.

          > it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them...

          > ... in tournament conditions.

          I'm suggesting that I'd like to see the ones that humans have found - outside of tournament conditions. Perhaps the gulf between us arises from an unspoken reference to solutions "unrealistic to expect a human to find" without the window-of-time qualifier?

          • jpadkins 16 hours ago

            I can wreck stockfish in chess boxing. Mostly because stockfish can't box, and it's easy for me to knock over a computer.

            • bionsystem 16 hours ago

              If it runs on a mainframe you would lose both the chess and the boxing.

          • crooked-v 16 hours ago

            The point of that qualifier is that you can expect to see weird moves outside of tournament conditions because casual games are when people experiment when that kind of thing.

    • vjerancrnjak 19 hours ago

      How are they faster? I don’t think any ELO report actually comes from participating at a live coding contest on previously unseen problems.

      • qsort 19 hours ago

        My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer.

        Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.

  • sedatk 19 hours ago

    Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970

    • tails4e 15 hours ago

      I just had chatgpt explain that problem to me (I was unfamiliar with the mathematical background). It showed how to solve closed form answers for H(2) and H(3) and then numerical solutions using RK4 for higher values. Truly impressive, and it explained the derivations beautifully. There are few maths experts I've encountered who could have hand-held me through it as good.

      • lionkor 3 hours ago

        Was the explanation correct?

  • thomasahle 21 hours ago

    I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p

    • lairv 20 hours ago

      Is there a solution to this exact problem, or to related notions (renewal equation etc.)? Anyway seems like nothing beats training on test

  • irthomasthomas 18 hours ago

    Are you sure it did not retrieve the answer using websearch?

  • id 19 hours ago

    gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.

  • j2kun 19 hours ago

    Did it search the web?

  • jamilton 16 hours ago

    Yeah, LLMs used to not be up to par for new Project Euler problems, but GPT-5 was able to do a few of the recent ones which I tried a few weeks ago.

  • motbus3 5 hours ago

    We need to wait and see. According to Google they have solved AI 10 years ago with Google Duo but somehow they keep smashing records despite being the worst coding tool until Gemini 2.5. Google internal benchmarks are irrelevant

  • bumling 17 hours ago

    I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.

  • NaomiLehman 4 hours ago

    definitely uses a lot of tooling. From "thinking":

    > I'm now writing a Python script to automate the summation computation. I'm implementing a prime sieve and focusing on functions for Rm and Km calculation [...]

  • lofaszvanitt 9 hours ago

    The problem is these models are optimized to solve the benchmarks, not real world problems.

  • ivape 13 hours ago

    So when does the developer admit defeat? Do we have a benchmark for that yet?

  • bgwalter 15 hours ago

    Does it matter if it is out of the training data? The models integrate web search quite well.

    What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.

    They would surely add the latest Euler problems with solutions in order to show off in benchmarks.

    • NaomiLehman 4 hours ago

      you can disable search.

      just create a different problem if you don't believe it.

  • panarky 15 hours ago

    [flagged]

    • vitorgrs 11 hours ago

      I usually ask a simple question that ALL the models get wrong: List of mayor of my city [Londrina]. ALL the models (offine) get wrong. And I mean, all the models. The best that I could, it's o3 I believe, saying it couldn't give a good answer for that, and told to access the city website.

      Gemini 3 somehow is able to give a list of mayors, including details on who got impeached, etc.

      This should be a simple answer, because all the data is on wikipedia, that certainly the models are trained on, but somehow most models don't manage to give that answer right, because... it's just a irrelevant city in a huge dataset.

      But somehow, Gemini 3 did it.

      Edit: Just asked "Cool places to visit in Londrina" (In portuguese), and it was also 99% right, unlike other models, who just create stuff. The only thing wrong here, it mentioned sakuras in a lake... Maybe it confused with Brazilian ipês, which are similar, and indeed the city it's full of them.

      It seems to have a visual understanding, imo.

      • guiambros 10 hours ago

        Ha, I just did the same with my hometown (Guaiba, RS), a city that is 1/6th of Londrina, and its wikipedia page in English hasn't been updated in years, and still has the wrong mayor (!).

        Gemini 3 nailed on the first try, included political affiliation, and added some context on who they competed with and won over in each of the last 3 elections. And I just did a fun application with AI Studio, and it worked on first shot. Pretty impressive.

        (disclaimer: Googler, but no affiliation with Gemini team)

      • miroljub 2 hours ago

        Funny, I just asked "Ask Brave", which uses a cheap LLM connected directly to its search engine, and it got it right without any issues.

        It shows once again that for common searches, (indexed) data is the king, and that's where I expect that even a simple LLM directly connected to a huge indexed dataset would win against much more sophisticated LLMs that have to use agents for searching.

      • usef- 10 hours ago

        Pure fact-based, niche questions like that aren't really the focus of most providers any more from what I've heard, since they can be solved more reliably by integrating search tools (and all providers now have search).

        I wouldn't be surprised if the smallest models can answer fewer such (fact-only) questions over time offline as they distill/focus them more thoroughly on logic etc.

      • arach 10 hours ago

        thanks for sharing, very interesting example

      • belter 6 hours ago

        I asked Claude, and had no issues with the answer including mentioning the impeached Antonio Belinati...

    • calmoo 3 hours ago

      Your comment is AI generated

      • tim333 38 minutes ago

        I'll give you the style is like an LLM but the thoughts seem a bit unlike one. I mean the MathArena Apex results indicating a new discovery rather than more data is definitely a hypothesis.

        Also panarky denies it.

    • Xss3 3 hours ago

      This comment was written by an AI specifically instructed to be more concise than usual.

    • red75prime 4 hours ago

      > To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

      You say "probabilistic generation" like it's some kind of a limitation. What is exactly the limiting factor here? [(0.9999, "4"), (0.00001, "four"), ...] is a valid probability distribution. The sampler can be set to always choose "4" in such cases.

    • legel 12 hours ago

      Thanks for reporting these metrics and drawing the conclusion of an underlying breakthrough in search.

      In his Nobel Prize winning speech, Demis Hassabis ends by discussing how he sees all of intelligence as a big tree-like search process.

      https://youtube.com/watch?v=YtPaZsasmNA&t=1218

      • derangedHorse 2 hours ago

        The one thing I got out of the MIT OpenCourseWare AI course by Patrick Winston was that all of AI could be framed as a problem of search. Interesting to see Demis echo that here.

    • rafaelmn 10 hours ago

      It tells me that the benchmark is probably leaking into training data, and going to the benchmark site :

      > Model was published after the competition date, making contamination possible.

      Aside from eval on most of these benchmarks being stupid most of the time, these guys have every incentive to cheat - these aren't some academic AI labs, they have to justify hundreds of billions being spent/allocated in the market.

      Actually trying the model on a few of my daily tasks and reading the reasoning traces all I'm seeing is same old, same old - Claude is still better at "getting" the problem.

    • coldtea 3 hours ago

      >This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.

      To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

      You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.

      Loos like AI slop

    • thomasahle 5 hours ago

      From my understanding, Google put online the largest RL cluster in the world not so long ago. It's not surprising they do really well on things that are "easy" to RL, like math or SimpleQA

    • dbbk 3 hours ago

      Aren't you just describing tool calls?

    • Der_Einzige 6 hours ago

      You clearly AI generated this comment.

    • thejarren 14 hours ago

      [flagged]

      • panarky 14 hours ago

        Hmmm, I wrote those words myself, maybe I've spent too much time with LLMs and now I'm talking like them??

        I'd be interested in any evidence-based arguments you might have beyond attacking my writing style and insinuating bad intent.

        I found this commenter had sage advice about how to use HN well, I try to follow it: https://news.ycombinator.com/item?id=38944467

        • thejarren 9 hours ago

          I’ll take you at your word, sorry for the incorrect callout. Your comment format appeared malicious, so my response wasn’t an attempt at being “snarky”, just acting defensively. I like the HN Rules/Guidelines.

        • momojo 12 hours ago

          You mentioned "step change" twice. Maybe a once over next time? My favorite Mark Twain quote is (very paraphrased) "My apologies, had I more time, I would have written a shorter letter".

          • versteegen 6 hours ago

            I thought the repetition was intentional.

        • ciberado 8 hours ago

          This is something that is happening to me too, and frankly I'm a little concerned. English is not my first language, so I use AI for checking and writing many things. And I spend a lot of time with coding tools. And now I need sometimes to do a conscient effort to avoid mimicking some LLM patterns...

      • ineedasername 10 hours ago

        1) Models learn these patterns from common human usage. They are in the wild, and as such there will be people who use them naturally.

        2) Now, given its for-some-reason-ubiquitous choice by models, it is also a phrasing that many more people are exposed to, every day.

        Language is contagious. This phrasing is approaching herd levels, meaning models trained from up-to-the-moment web content will start to see it as less distinctly salient. Eventually, there will be some other high-signal novel phrase with high salience, and the attention heads will latch on to it from the surrounding context, and then that will be the new AI shibboleth.

        It's just how language works. We see it in the mixes between generations when our kids pick up new lingo, and then it stops being in-group for them when it spreads too far.. Skibidi, 6 7, etc.

        It's just how language works, and a generation ago the internet put it on steroids. Now? Even faster.

      • sindriava 14 hours ago

        You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.

        • roncesvalles 12 hours ago

          Also discounting the fact that people actually do talk like that. In fact, these days I have to modify my prose to be intentionally less LLM-like lest the reader thinks it's LLM output.

  • orly01 21 hours ago

    Wow. Sounds pretty impressive.

davidpolberger 15 hours ago

This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.

I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.

  • zarzavat 11 hours ago

    This is exactly the kind of task that LLMs are good at.

    They are good at transforming one format to another. They are good at boilerplate.

    They are bad at deciding requirements by themselves. They are bad at original research, for example developing a new algorithm.

    • arnorhs an hour ago

      +/-

      > They are bad at deciding requirements by themselves.

      What do you mean by requirements here? In my experience the frontier models today are pretty good at figuring out requirements, even when you don't explicitly state them.

      > They are bad at original research

      Sure, I don't have any experience with that, so I'll trust you on that.

      > for example developing a new algorithm.

      This is just not correct. I used to think so, but I was trying to come up with a pretty complicated pattern matching, multi-dimensional algorithm (I can't go into the details) - it was something that I could figure out on my own, and was half way through it, but decided to write up a description of it and feed it to gemini 2.5 pro a couple of months ago, and I was stunned.

      It came up with a really clever approach and something I had previously been convinced the models weren't very good at it.

      In hindsight, since they are getting so good at math in general, there's probably some overlap, but you should revisit your views on this.

      --

      Your 'bad at' list is missing a few things though:

      - Calculations (they can come up with how to calculate or write a program to calculate from given data, but they are not good at calculating in their responses)

      - Even though the frontier models are multi-modal, they are still bad at visualizing html/css - or interpreting what it would look like

      - Same goes for visualizing/figuring out visual errors in graphics programming such as games programming or 3d modeling (z-index issues, orientation etc)

    • codespin 11 hours ago

      > They are good at transforming one format to another. They are good at boilerplate.

      You just described 90% of coding

      • Cthulhu_ 6 hours ago

        Thing is, and LLM doesn't need motivation or self-discipline to start writing, which at this point I'm confident is the main slowing down factor in software development, after requirements etc.

        • agumonkey 2 hours ago

          These also have larger memory in a way, or deeper stacks of facts. They seems to be able to explore way more sources rapidly and thus emit a solution with more knowledge. As a human I will explore less before trying to solve a problem, and only if that fails I will dig deeper.

      • nwienert 11 hours ago

        They’re bad at 90% of coding, but for other reasons. That said if you babysit them incessantly they can help you move a bit faster through some of it.

      • oblio 10 hours ago

        90% of writing code, sure. But most professionnel programmers write code maybe 20% of the time. A lot of the time is spent clarifying requirements and similar stuff.

        • Agentlien 9 hours ago

          The more I hear about other developers' work, the more varied it seems. I've had a few different roles, from one programmer in a huge org to lead programmer in a small team, with a few stints of technical expert in-between. For each the kind of work I do most has varied a lot, but it's never been mostly about "clarifying requirements". As a grunt worker I mostly just wrote and tested code. As a lead I spent most time mentoring, reviewing code, or in meetings. These days I spend most of my time debugging issues and staring at graphics debugger captures.

          • oblio 6 hours ago

            > As a lead I spent most time

            > mentoring

            Clarifying either business or technical requirements for newer or junior hires.

            > reviewing code

            See mentoring.

            > or in meetings

            So clarifying requirements from/for other teams, including scope, purely financial or technical concerns, etc.

            Rephrase "clarifying requirements" to "human oriented aspects of software engineering".

            Plus, based on the graphics debugger part of your comment, you're a game developer (or at least adjacent). That's a different world. Most software developers are line of business developers (pharmaceutical, healthcare, automotive, etc) or generalists in big tech companies that have to navigate very complex social environments. In both places, developers that are just heads down in code tend not to do well long term.

            • falcor84 4 hours ago

              > human oriented aspects

              The irony is of course that humans in general and software professionals in particular (myself definitely included) notoriously struggle with communication, whereas RLHF is literally optimizing LLMs for clear communication. Why wouldn't you expect an AI that's both a superhuman coder and a superhuman communicator to be decent at translating between human requirements and code?

              • oblio 2 hours ago

                > Why wouldn't you expect an AI that's both a superhuman coder and a superhuman communicator to be decent at translating between human requirements and code?

                At this point LLMs are a superhuman nothing, except in terms of volume, which is a standard computer thing ("To err is human, but to really foul things up you need a computer" - a quote from 60 years ago).

                LLMs are fast, reasonably flexible, but at the moment they don't really raise the ceiling in terms of quality, which is what I would define as "superhuman".

                They are comparatively cheaper than humans and volume matters ("quantity has a quality all its own" - speaking of quotes). But I'm fairly sure that superhuman to most people means "Superman", not 1 trillion ants :-)

                • falcor84 37 minutes ago

                  I wrote that based on my experience comparing my prose writing and code to what I can get from ChatGPT or Claude Code, which I feel are on average significantly higher quality than what I can do on a single pass. The quality still improves when I critique its output and iterate with it, but from what I tried, the quality of the result of it doing the work and me critiquing it is better (and definitely faster) than what I get when I try to do it myself and have it critique my approach.

                  But maybe it's just because I personally am not as good as others, so let me try to offer some examples of tasks where the quality of AI output is empirically better than the human baseline:

                  1. Chess (and other games) - Stockfish has an ELO of 3644[0], compared to Magnus Carlsen at 2882

                  2. Natural Language understanding - AIs surpassed the human expert baseline on SuperGlue a while ago [1]

                  3. General image classification - On Imagenet top-5, facebook's convnext is at 98.55 [2], while humans are at about 94.9% [3]. Humans are still better at poor lighting conditions, but with additional training data, AIs are catching up quickly.

                  4. Cancer diagnosis - on lymph-node whole slide images, the best human pathologist in the study got an AUC of 0.884, while the best AI classifier was at 0.994 [4]

                  5. Competition math - AI is at the level of the best competitors, achieving gold level at the IMO this year [5]. It's not clearly superhuman yet, but I expect it will be very soon.

                  6. Competition coding - Here too AI is head to head with the best competitors, successfully solving all problems at this year's ICPC [6]. Similarly, at the AtCoder World Tour Finals 2025 Heuristic contest, only one human managed to beat the OpenAI submission [7].

                  So summing this up, I'll say that even if AI isn't better at all of these tasks than the best prepared humans, it's extremely unlikely that I'll get one of those humans to do tasks for me. So while AI is still very flawed, I already quite often prefer to rely on it rather to delegate to another human, and this is as bad as it ever will be.

                  P.S. While not a benchmark, there's a small study from last year that looked at the quality of AI-generated code documentation in comparison to the actual human-written documentation in a variety of code bases and found "results indicate that all LLMs (except StarChat) consistently outperform the original documentation generated by humans." [8]

                  [0] https://computerchess.org.uk/ccrl/4040/

                  [1] https://super.gluebenchmark.com/

                  [2] https://huggingface.co/spaces/Bekhouche/ImageNet-1k_leaderbo...

                  [3] https://cs.stanford.edu/people/karpathy/ilsvrc/

                  [4] https://jamanetwork.com/journals/jama/fullarticle/2665774

                  [5] https://deepmind.google/blog/advanced-version-of-gemini-with...

                  [6] https://worldfinals.icpc.global/2025/openai.html

                  [7] https://arstechnica.com/ai/2025/07/exhausted-man-defeats-ai-...

                  [8] https://arxiv.org/pdf/2312.10349

dwringer a day ago

Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.

> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.

[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

  • kjgkjhfkjf 16 hours ago

    This is quite likely to be in the training data, since it's one of the projects in Wes Bos's free 30 days of Javascript course[0].

    [0] https://javascript30.com/

    • baxtr 16 hours ago

      I was under the impression for this to work like that, training data needs to be plenty. One project is not enough since it’s too "sparse".

      But maybe this example was used by many other people and so it proliferated?

  • stalfie 21 hours ago

    The subtle "wiggle" animation that the second hand makes after moving doesn't fire when it hits 12. Literally unwatchable.

    • apetresc 18 hours ago

      In its defence, the code actually specifically calls that edge case out and justifies it:

          // Calculate rotations
          // We use a cumulative calculation logic mentally, but here simple degrees work because of the transition reset trick or specific animation style.
          // To prevent the "spin back" glitch at 360->0, we can use a simple tick without transition for the wrap-around,
          // but for simplicity in this specific React rendering, we will stick to standard 0-360 degrees.
          // A robust way to handle the spin-back on the second hand is to accumulate degrees, but standard clock widgets often reset.
    • skipnup 18 hours ago

      The Swiss and German railway clocks actually work the same way and stop for (half a?) second while the minute handle progresses.

      https://youtu.be/wejbVtj4YR0

      • quickthrowman 16 hours ago

        Station clocks in Switzerland receive a signal from a master clock each minute that advances the minute hand, the seconds hand moves completely independent from the minute hand. This allows them to sync to the minute.

        > The station clocks in Switzerland are synchronised by receiving an electrical impulse from a central master clock at each full minute, advancing the minute hand by one minute. The second hand is driven by an electrical motor independent of the master clock. It takes only about 58.5 seconds to circle the face; then the hand pauses briefly at the top of the clock. It starts a new rotation as soon as it receives the next minute impulse from the master clock.[3] This movement is emulated in some of the licensed timepieces made by Mondaine.

        https://en.wikipedia.org/wiki/Swiss_railway_clock

      • Severian 17 hours ago

        The video shows closer to 2 seconds for it to finally throw itself over in what could only be described as a "Thunk". I figured it would be a little more smooth.

    • scotty79 12 hours ago

      Fixed with prompt "Second hand doesn't shake when it lands on 12, fix it." and 131 seconds. With a bunch of useState()-s and a useEffet()

  • kldg 15 hours ago

    in defense of 2.5 (Pro, at least), it was able to generate for me a metric UNIX clock as a webpage which I was amused by. it uses kiloseconds/megaseconds/etc. there are 86.4ks/day. The "seconds" hand goes around 1000 seconds, which ticks over the "hour" hand. Instead of saying 4am, you'd say it's 14.

    as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.

    I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.

  • malfist 16 hours ago

    That is not the same prompt as the other person was using. In particular this doesn't provide the time to set the clock to, which makes the challenge a lot simpler. This also includes javascript.

    The prompt the other person was using is:

    ``` Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting. ```

    Which is much more difficult.

    For what it's worth, I supplied the same prompt as the OG clock challenge and it utterly failed, not only generating a terrible clock, but doing so with a fair bit of typescript: https://ai.studio/apps/drive/1c_7C5J5ZBg7VyMWpa175c_3i7NO7ry...

  • thegrim33 21 hours ago

    "Allow access to Google Drive to load this Prompt."

    .... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.

    • LiamPowell 21 hours ago

      You don't want to give Google access to files you've stored in Google Drive? It's also only access to an application specific folder, not all files.

      • tibbar 19 hours ago

        Well, you also have to allow it to train on your data. Although this is not explicitly about your Google drive data, and probably requires you to submit a prompt yourself, the barriers here are way to weak/fuzzy for me consider granting access via any account with private info.

    • dragonwriter 11 hours ago

      I'm assuming because AI Studio persisted, including shared, prompts are stored in Drive, and prompt sharing is implemented on top of Drive file sharing, so if AI Studio doesn't have access to Drive it doesn't have access to the shared prompt.

    • lxgr 15 hours ago

      Because most likely (at least according to Hanlon's razor) they somehow decided that using Google Drive as the only persistent storage backing AI studio was a reasonable UX decision.

      It probably makes some sense internally in big tech corporation logic (no new data storage agreements on top of the ones the user has already agreed to when signing up for Drive etc.), but as a user, I find it incredibly strange too – especially since the text chats are in some proprietary format I can't easily open on my local GDrive replica, but the images generated or uploaded just look like regular JPEGs and PNGs.

  • skybrian 21 hours ago

    It looks quite nice, though to nitpick, it has “quartz” and “design & engineering” for no reason.

    • wongarsu 20 hours ago

      Just like actual cheap but not bottom of the barrel clocks

  • pmarreck 21 hours ago

    holy shit! This is actually a VERY NICE clock!

  • dyauspitr 21 hours ago

    Having seen the page the other day this is pretty incredible. Does this have the same 2000 token limit as the other page?

    • Taek 11 hours ago

      No, and also the other page was pure HTML and CSS. This clock is using React and Javascript, so it's not a fair comparison.

    • dwringer 21 hours ago

      This isn't using the same prompt or stack as the page from that post the other day; on aistudio it builds a web app across a few different files. It's still fairly concise but I don't think it's that much so.

      • malfist 16 hours ago

        It also includes javascript which was verboten in the original prompt, and doesn't specify the time the clock should be set too.

SXX 20 hours ago

Static Pelican is boring. First attempt:

Generate SVG animation of following:

1 - There is High fantasy mage tower with a top window a dome

2 - Green goblin come in front of tower with a torch

3 - Grumpy old mage with beard appear in a tower window in high purple hat

4 - Mage sends fireball that burns goblin and all screen is covered in fire.

Camera view must be from behind of goblin back so we basically look at tower in front of us:

https://codepen.io/Runway/pen/WbwOXRO

  • SXX 19 hours ago

    After few more attempts longer animation with a story from my gamedev inspired mind:

    https://codepen.io/Runway/pen/zxqzPyQ

    PS: but yeah thats attempt #20 or something.

    • arjie 17 hours ago

      This is bloody magical. I cannot believe it.

    • fatty_patty89 19 hours ago

      Seizure warning for the above link

      edit: flashing lights at the end seem to be mostly becauseo f darkreader extension

    • nyantaro1 17 hours ago

      we are so cooked

      • camdenreslink 15 hours ago

        That SVG is impressive, but wouldn’t be usable in a real product as-is.

    • hollowturtle 11 hours ago

      Wow looks like total shit and eventually very hard to take on and actually improve it, given the convoluted code it generated, YET people are impressed. What world are we living in...

      • empath75 2 hours ago

        When people figure out how to make a computer do something that it couldn't do before, that is interesting and impressive. It doesn't need to be useful.

      • Kiro 8 hours ago

        You can criticize the code but "wow looks like total shit" is such an embarrassing thing to say considering the context. Imagine going back a few years and show them a tool outputting this from text. No-one would believe it.

        • hollowturtle 5 hours ago

          It simply is non impressive at all to me, we had an industry(games not web) that was the most innovativd and was able to do things, and in part still is, thousands of years ahead of the slop glorified here

      • hollowturtle 11 hours ago

        Let me double down to get even more downvotes, here's a snippet of the code:

        > setTimeout(() => showSub("Ah, Earl Grey.", 2000), 1000);

        > setTimeout(() => showSub("Finally some peace.", 2000), 3500);

        > // Scene 2

        > setTimeout(() => showSub("Armor Clanking", 2000), 7000);

        > setTimeout(() => showSub("Heavy Breathing", 2000), 10000);

        If we will lose our jobs to this dumb slop I'd rather be happy doing something else

        • sejje 10 hours ago

          How would you do it?

          • Timwi 9 hours ago

            I would properly separate data and code so that I can easily change the dialogue and its timing without having to rewrite all of the numbers in all of the code.

            • DrSiemer 8 hours ago

              Your desired setup is just a single prompt away...

              • hollowturtle 5 hours ago

                sure let's keep piling slop over slop, they're not very good to de-spaghettify code, they're good at piling further slop

                • SXX 3 hours ago

                  LLMs only as good at software architecture as you are.

                  • Jensson 2 hours ago

                    Isn't that the issue though?

                    • SXX 2 hours ago

                      If you are good, like, no?

                      I mean I crafted complete complex game prototype using Gemini 2.5 Pro with nearly zero coding. I done it in a week: with client-server architecture, robust networking, AI, acceptance test coverage, replays.

                      It just different way to build software. You just spend 30% of time on specification, 30% on testing and 30% on refactoring also using AI.

                      Actual slop generarion take like 10% of time and rest of the time you turn it into maintainable code.

                      Of course you can do it manually, but then it will take 5-10 times the time and you wont be as flexible in changing things because with AI you can do major refactoring in a day, but manually it could take weeks and kill the project.

            • jameshush 7 hours ago

              Why bother doing that when a non-engineer can just change the prompt and output a different result? :shrug:

              • hollowturtle 5 hours ago

                right and keep piling slop over slop, software will collapse with this mentality. And more importantly the more the code is convoluted the more even the llm will bail out and won't be able to make further adjustments because of bad code and context rot

      • SXX 9 hours ago

        You are missing the point of this exercise. This is not about code quality - its about capacity of model to generate visuals with no guidance.

        For the code quality it can really be as good or as bad ad as you desire. In this case it is what it is because I put zero effort into it.

        • hollowturtle 5 hours ago

          non impressive at all to me, visuals are bad not even a student starting in animations would produce that slop. You're glorifying slop, as for the code quality that's not about styling or semantics the tecniques used are BAD and won't scale at all, eg setTimeout is not designed to be run at exactly that interval, it's just a timeout suggestion. And no it cannot be good or bad as you desire it's just bad, I have YET to see something better than an animation student on the first year would do. You're destroying the software industry with this mentality

          • mlrtime 2 hours ago

            FWIW I don't agree with anything you're saying but again, I'm glad there is some debate from another side.

            I suck at writing software, like bad. I can't remember syntax at all. I couldn't write working code on a whiteboard if you asked me.

            But I don't know how to solve problems very well, and I'm good at understanding what people want and don't want. I do understand logic and pseudocode.

            The code LLMs write is good enough for 99% of the things I need it for, and I'm not writing code that will be used in some life determining situation, and I'd wager that most aren't either.

            We could debate on if my code is usable/supportable long-term, by myself or others. However, I don't see how that debate would be any different if I wrote it myself (worse) or somebody else.

            • agentcoops 2 hours ago

              Yes, it’s a very narrow-minded perspective that cannot understand the second-order implications of this development beyond their own experience as an experienced developer. For argument, let’s imagine that the quality of software at the top valley firms is just phenomenal (a stretch, as we all know, even as a hypothetical). That is obviously not the case for the quality of software at 99% of firms. One could argue that the dominance of SaaS this past decade is an artifact of the software labor market: any vaguely talented engineer could easily get a ridiculously well-paid position in the valley for a firm that sold software at great margins to all the other firms that were effectively priced out of the market for engineers. I think the most interesting case study of this is actually the gaming industry, since it’s a highly technical engineering domain where margins are quickly eroded by paying the actual market wage for enough engineers to ship a good product, leading to the decline of AAA studios. Carmack’s career trajectory from gaming industry to Meta is paradigmatic of the generational shift, here.

              TLDR; in my opinion, the interesting question is less what happens at the top firms or to top engineers than what happens as the rest of the world gains access to engineering skills well above the previous floor at a reasonable price point.

              • alchemism 2 minutes ago

                Business software that is responsible for millions in revenue tends to resemble an ETL shell script more than a 3D game engine.

  • mclau153 19 hours ago

    we are returning to flash animations after 20 years

    • KingMob 4 hours ago

      Nature is healing!

      But seriously, we lost a lot when Flash was killed. It was an era of accessible animation and games like Newgrounds and Homestar Runner, that had no ready replacement.

  • sosodev 20 hours ago

    Wow, that's very impressive

  • Rudybega 20 hours ago

    Holy crap. That's actually kind of incredible for a first attempt.

tim333 16 minutes ago

Hassabis interview on Gemini 3, with Hard Fork (nyt podcast) https://youtu.be/rq-2i1blAlU?t=428 Some points -

Good at vibe coding 10:30 - step change where it's actually useful

AGI still 5-10 years. Needs reasoning, memory, world models.

Is it a bubble - partly 22:00

prodigycorp a day ago

I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).

For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).

This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.

edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.

  • WhitneyLand a day ago

    >>benchmarks are meaningless

    No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.

    >>my fairly basic python benchmark

    I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.

    • NaomiLehman a day ago

      they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing.

      I like to compare them using chathub using the same prompts

      Gemini still calls me "the architect" in half of the prompts. It's very cringe.

      • sothatsit 14 hours ago

        It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.

        This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses. It takes time for people to really get to grips with a new model, not just a few prompt comparisons where luck and prompt selection will play a big role.

      • mpalmer 14 hours ago

            Gemini still calls me "the architect" in half of the prompts. It's very cringe.
        
        Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?
        • NaomiLehman 4 hours ago

          it absolutely does. and human employees don't call me "the architect." that's the point.

          • gregw2 an hour ago

            I wonder if under the covers it uses your word choices to infer your Myers-Briggs personality type and you are INTJ so it calls you "The Architect"?? Crazy thought but conceivable...

      • beepbooptheory 18 hours ago

        I get that one can perhaps have an intuition about these things, but doesn't this seem like a somewhat flawed attitude to have all things considered? That is, saying something to the effect of "well I know its not too sycophantic, no measurement needed, I have some special prompts of my own and it passed with flying colors!" just sounds a little suspect on first pass, even if its not like totally unbelievable I guess.

  • dekhn a day ago

    Using a single custom benchmark as a metric seems pretty unreliable to me.

    Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

    • prodigycorp 20 hours ago

      after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.

      This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.

      While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.

      my bad to the google team for the cursory brush off.

      • chermi 18 hours ago

        Walks are magical. But also this reads partially like you got sent to a reeducation camp lol.

      • nomel 17 hours ago

        > This probably means my test is a little too niche.

        > my python one needs to be down weighted or supplanted.

        To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.

        I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".

        • relaytheurgency 15 hours ago

          I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.

      • agentcoops 2 hours ago

        I definitely agree on the importance of personalized benchmarks for really feeling when, where and how much progress is occurring. The standard benchmarks are important, but it’s hard to really feel what a 5% improvement in X exam means beyond hype. I have a few projects across domains that I’ve been working on since ChatGPT 3 launched and I quickly give them a try on each new model release. Despite popular opinion, I could really tell a huge difference between GPT 4 and 5 , but nothing compared to the current delta between 5.1 and Gemini 3 Pro…

        TLDR; I don’t think personal benchmarks should replace the official ones of course, but I think the former are invaluable for building your intuition about the rate of AI progress beyond hype.

    • lofaszvanitt 9 hours ago

      No, do not share it. The bigger black hole these models are in, the better.

  • thefourthchime a day ago

    I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.

    • bitexploder 20 hours ago

      Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.

      • amelius 18 hours ago

        I thought this kind of chaining was already part of these systems.

    • Workaccount2 20 hours ago

      It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.

    • ofa0e a day ago

      Your benchmarks should not involve IP.

      • sowbug 21 hours ago

        The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.

        • bongodongobob 20 hours ago

          It's not an ethics thing. It's a guardrails thing.

          • sowbug 19 hours ago

            That's a valid point, though an average LLM would certainly understand the difference between trademark and other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").

      • ComplexSystems a day ago

        Why? This seems like a reasonable task to benchmark on.

        • adastra22 a day ago

          Because you hit guard rails.

        • ofa0e a day ago

          Sure, reasonable to benchmark on if your goal is to find out which companies are the best at stealing the hard work of some honest, human souls.

          • scragz a day ago

            correction: pacman is not a human and has no soul.

            • WhyOhWhyQ 13 hours ago

              Why do you have to willfully misinterpret the person you're replying to? There's truth in their comment.

  • sosodev a day ago

    How can you be sure that your benchmark is meaningful and well designed?

    Is the only thing that prevents a benchmark from being meaningful publicity?

    • prodigycorp a day ago

      I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.

      I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.

      • gregsadetsky a day ago

        I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?

        I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?

        Thanks

        [0] https://news.ycombinator.com/item?id=45968665

        • adastra22 a day ago

          > if it's not public, presumably LLMs would never get better at them.

          Why? This is not obvious to me at all.

          • gregsadetsky a day ago

            You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.

            • adastra22 21 hours ago

              That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.

              But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.

              Problem solving ability is largely not from the pretraining data.

              • gregsadetsky 21 hours ago

                Yeah, great point.

                I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)

  • benterix a day ago

    > This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.

    Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.

    • Archer6621 7 hours ago

      I wonder whether it could be related to some kind of over-fitting, i.e. a prompting style that tends to work better with the older models, but performs worse with the newer ones.

    • adastra22 a day ago

      I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.

    • Iulioh a day ago

      A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....

      GPT4/3o might be the best we will ever have

  • ddalex a day ago

    I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code

    • layer8 20 hours ago

      Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?

  • mring33621 a day ago

    I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.

    I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.

    IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.

    This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!

  • t0mas88 17 hours ago

    Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.

  • testartr a day ago

    and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much

    it's easy to focus on what they can't do

    • big-and-small a day ago

      Everything is about context. When you just ask non-concrete task it's still have to parse your input and figure what is tic-tac-toe in this context and what exactly you expect it to do. This is why all "thinking".

      Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.

      Then make it imagine playing against you and it's gonna be fast and reliable.

      • testartr 16 hours ago

        prompt was very concrete: draw a tic tac toe ASCII table and let's play. gemini 2.5 thought for pages particular moves

  • Rover222 a day ago

    curious if you tried grok 4.1 too

  • Filligree a day ago

    What's the benchmark?

    • ahmedfromtunis a day ago

      I don't think it would be a good idea to publish it on a prime source of training data.

      • mlrtime 2 hours ago

        This sounds like paranoia to me to be honest. Please tell me I'm wrong.

        I could have easily come up with just the same claim, without seeing the benchmark, it doesn't exist.

        Maybe if we weren't anonymous and your profile leads to credentials that you have experience in this field, otherwise I don't believe it without seeing/testing myself.

      • Hammershaft a day ago

        He could post an encrypted version and post the key with it to avoid it being trained on?

        • benterix a day ago

          What makes you think it wouldn't end up in the training set anyway?

        • rs186 21 hours ago

          I wouldn't underestimate the intelligence of agentic AI, despite how stupid they are today.

        • stefs 16 hours ago

          Every AI corp has people reading HN.

      • shawabawa3 4 hours ago

        but they've asked all the AI models this question. Whatever you tell an AI model is also in its training data

    • petters a day ago

      Good personal benchmarks should be kept secret :)

      • mlrtime 2 hours ago

        why?

        • pclmulqdq an hour ago

          Avoiding contamination is very useful when you want an honest evaluation of something.

    • prodigycorp a day ago

      nice try!

      • ankit219 20 hours ago

        you already sent the prompt to gemini api - and they likely recorded it. So in a way they can access it anyway. Posting here or not would not matter in that aspect.

  • mupuff1234 a day ago

    Could also just be rollout issues.

    • prodigycorp a day ago

      Could be. I'll reply to my comment later with pass/fail results of a re-run.

  • luckydata a day ago

    I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.

    • nomel 17 hours ago

      I find this hard to understand. I have AI completely choke on my code constantly. What are you doing where it performs so well? Web?

      I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).

      I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...

  • m00dy a day ago

    that's why everyone using AI for code should code in rust only.

simonw 19 hours ago

Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/

  • torginus 16 hours ago

    Considering how important this benchmark has become to the judgement of state of the art AI models, I imagine each AI lab has a dedicated 'pelican guy', a a highly accomplished and academically credentialed person, who's working around the clock on training the model to make better and better SVG pelicans on bikes.

  • skylurk 18 hours ago

    They've been training for months to draw that pelican, just for you to move the goalposts.

    • oblio 10 hours ago

      It's a pelican on a bike, not a goalpost. And bikes move. Well, pelicans move, too.

  • mtrovo 18 hours ago

    It's interesting that you mentioned on a recent post that saturation on the pelican benchmark isn't a problem because it's easy to test for generalization. But now looking at your updated benchmark results, I'm not sure I agree. Have the main labs been climbing the Pelican on a bike hill in secret this whole time?

  • Thrymr 18 hours ago

    Considering how many other "pelican riding a bicycle" comments there are in this thread, it would be surprising if this was not already incorporated in the training data. If not now, soon.

    • Workaccount2 16 hours ago

      I don't think the big labs would waste their time on it. If a model is great at making the pelican but sucks at all other svg it becomes obvious. But so far the good pelicans are strong indicators of good general SVG ability.

      Unless training on the pelican increases all SVG ability, then good job.

      • oceansky 15 hours ago

        I absolutely think they would given the amount of money and hype being pumped into it.

  • libraryofbabel 18 hours ago

    I was interested (and slightly disappointed) to read that the knowledge cutoff for Gemini 3 is the same as for Gemini 2.5: January 2025. I wonder why they didn't train it on more recent data.

    Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?

    • simonw 18 hours ago

      The model card says: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

      > This model is not a modification or a fine-tune of a prior model.

      I'm curious why they decided not to update the training data cutoff date too.

      • stocksinsmocks 18 hours ago

        Maybe that date is a rule of thumb for when AI generated content became so widespread that it is likely to have contaminated future data. Given that people have spoofed authentic Reddit users with Markov chains, it probably doesn’t go back nearly far enough.

  • tkgally 14 hours ago

    I updated my benchmark of 30 pelican-bicycle alternatives that I posted here a couple of weeks ago:

    https://gally.net/temp/20251107pelican-alternatives/index.ht...

    There seem to be one or two parsing errors. I'll fix those later.

    • hbbio 7 hours ago

      You should add ChatGPT.

      I tried the first one and 5 Pro gives this: https://imgur.com/a/EhYroCE

      • tkgally 3 hours ago

        Thanks for the suggestion. I’m not sure why I didn’t include an OpenAI model in my first round. Here’s the updated page with GPT-5.1 results added:

        https://gally.net/temp/20251107pelican-alternatives/index.ht...

        As your example shows, GPT-5 Pro would probably be better that GPT-5.1, but the tokens are over ten times more expensive and I didn’t feel like paying for them.

abixb 25 minutes ago

Okay, Gemini 3.0 Pro has officially surpassed Claude 4.5 (and GPT-5.1) as the top ranked model based on my private evals (multimodal reasoning w/ images/audio files and solving complex Caesar/transposition ciphers, etc.).

Claude 4.5 solved it as well (the Caesar/transposition ciphers), but Gemini 3.0 Pro's method and approach was a lot more elegant. Just my $0.02.

ttul a day ago

My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.

  • rfw300 20 hours ago

    My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:

    - Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts

    - Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.

    - Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.

    Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.

    • Rudybega 18 hours ago

      I wonder if you could get around this with a slightly more sophisticated harness. I suspect you're running into context length issues.

      Something like

      1.) Split audio into multiple smaller tracks. 2.) Perform first pass audio extraction 3.) Find unique speakers and other potentially helpful information (maybe just a short summary of where the conversation left off) 4.) Seed the next stage with that information (yay multimodality) and generate the audio transcript for it

      Obviously it would be ideal if a model could handle the ultra long context conversations by default, but I'd be curious how much error is caused by a lack of general capability vs simple context pollution.

    • satvikpendem 7 hours ago

      Now try an actual speech model like ElevenLabs or Soniox, not something not made for it.

    • ant6n 19 hours ago

      The worst when it fails to eat simple pdf documents and lies and gas lights in an attempt to cover it up. Why not just admit you can’t read the file?

      • nomel 16 hours ago

        This is specifically why I don't use Gemini. The gaslighting is ridiculous.

  • satvikpendem 21 hours ago

    I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.

    • trvz 7 hours ago

      Agreed. I don’t see the need for Gemini to be able to do this task, although it should be able to offload it to another model.

  • iagooar a day ago

    What prompt do you use for that?

    • gregsadetsky a day ago

      I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.

      3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:

          [00:00] Greg: Hello.
          [00:01] X: You great?
          [00:02] Greg: Hi.
          [00:03] X: I'm X.
          [00:04] Y: I'm Y.
          ...
      
      Super impressive!
      • HPsquared a day ago

        Does it deduce everyone's name?

        • gregsadetsky a day ago

          It does! I redacted them, but yes. This was a 3-person call.

    • punnerud a day ago

      I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)

  • renegade-otter 21 hours ago

    It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.

  • valtism a day ago

    Parakeet TDT v3 would be really good at that

    • kridsdale3 15 hours ago

      Yes, this is the best solution for that goal. Use the MacWhisper app + Parakeet 3.

Workaccount2 a day ago

It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.

Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.

Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.

  • GuB-42 an hour ago

    > Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed.

    I see that AI is reaching the level of a middle school boy...

  • recitedropper 21 hours ago

    Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.

    • Workaccount2 21 hours ago

      I strongly suspect it's a tokenization problem. Text and symbols fit nicely in tokens, but having something like a single "dog leg" token is a tough problem to solve.

      • stalfie 20 hours ago

        The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.

        (Source: "The mind is flat" by Nick Chater)

        • machiaweliczny 20 hours ago

          It's also easy to spot as when you are tired you might misrecognize objects, I caught myself with this when doing long roadtrips

      • recitedropper 18 hours ago

        I think in this case, tokenization and percpetion are somewhat analogous. I think it is probably the case our current tokenization schemes are really simplistic compared to what nature is working with. If you allow the analogy.

    • orly01 21 hours ago

      Why should it have to be expensive computationally? How do brains do it with such a low amount of energy? I think catching the brain abilities even of a bug might be very hard, but that does not mean that there isn't a way to do it with little computational power. It requires having the correct structures/models/algorithms or whatever is the precise jargon.

      • nomel 16 hours ago

        > How do brains do it with such a low amount of energy?

        Physical analog chemical circuits whose physical structure directly is the network, and use chemistry/physics directly for the computations. For example, a sum is usually represented as the number of physical ions present within a space, not some ALU that takes in two binary numbers, each with some large number of bits, requiring shifting electrons to and from buckets, with a bunch of clocked logic operations.

        There are a few companies working on more "direct" implementations of inference, like Etched AI [1] and IBM [2], for massive power savings.

        [1] https://en.wikipedia.org/wiki/Etched_(company)

        [2] https://spectrum.ieee.org/neuromorphic-computing-ibm-northpo...

      • recitedropper 20 hours ago

        This is the million dollar question. I'm not qualified to answer it, and I don't really think anyone out there has the answer yet.

        My armchair take would be that watt usage probably isn't a good proxy for computational complexity in biological systems. A good piece of evidence for this is from the C. elegans research that has found that the configuration of ions within a neuron--not just the electrical charge on the membrane--record computationally-relevant information about a stimulus. There are probably many more hacks like this that allow the brain to handle enormous complexity without it showing up in our measurements of its power consumption.

        • programd 16 hours ago

          My armchair is equally comfy, and I have an actual paper to point to:

          Jaxley: Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics [1]

          They basically created sofware to simulate real neurons and ran some realistic models to replicate typical AI learning tasks:

          "The model had nine different channels in the apical and basal dendrite, the soma, and the axon [39], with a total of 19 free parameters, including maximal channel conductances and dynamics of the calcium pumps."

          So yeah, real neurons are a bit more complex then ReLU or Sigmoid.

          [1] https://www.biorxiv.org/content/10.1101/2024.08.21.608979v2....

        • nick32661123 19 hours ago

          Following the trend of discovering smaller and smaller phenomena that our brains use for processing, it would not be surprising if we eventually find that our brains are very nearly "room temperature" quantum computers.

  • column 5 hours ago

    "[a photoshopped picture of a dog with 5 legs]...please count the legs"

    Meanwhile you could benchmark for something actually useful. If you're about to say "But that means it won't work for my use case of identifying a person on a live feed" or whatever, then why don't you test that? I really don't understand the kick people get of successfully tricking LLMs on non productive task with no real world application. Just like the "how many r in strawberry?", "uh uh uh it says two urh urh".. ok but so what? What good is a benchmark that is so far from a real use case?

    • tngranados 18 minutes ago

      The point of benchmarking that is checking for hallucinations and overfitting. Does the model actually check the picture to count the legs or does it just see it's a dog and answer four because it knows dogs usually has four legs?

      It's a perfectly valid benchmark and very telling.

  • lukebechtel a day ago

    ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.

    • achow 21 hours ago

      Op is right.

      https://imgcdn.stablediffusionweb.com/2024/4/19/8e54925a-004...

      For the above pic I asked "What is wrong with the image?"

      Result: - It totally missed the most obvious one - six fingers.

      Instead it said this:

      Based on a close inspection of the image, here are the things that are "wrong" or anomalous, indicating that it is likely an AI-generated image:

      - The Thumb Anatomy: This is the most obvious error. The digit in the thumb's position (far left) looks exactly like a long index finger. A natural human thumb has two segments (phalanges) and is shorter and thicker, opposing the other fingers. This digit appears to have three segments and is roughly the same length and shape as the other fingers, giving the hand the appearance of having five fingers and no thumb.

      - Unnatural Skin Texture: The skin is unnaturally smooth, waxy, and free of normal human imperfections like distinct pores, fine wrinkles, or hair. It has a "plastic" or "airbrushed" look typical of generative AI.

      - Palm Lines: The creases on the palm are vague and generic. They don't follow the distinct pattern of major flexion creases (like the heart, head, and life lines) found on a real human hand.

      - Wrist Connection: The connection between the hand and the wrist looks slightly distorted, with tendons that appear too prominent and smooth for the pose.

      [Edit: 3.0 is same as 2.5 - both answered almost identically]

falcor84 18 hours ago

I love it that there's a "Read AI-generated summary" button on their post about their new AI.

I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:

> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.

- from "Dirk Gently's Holistic Detective Agency"

  • davedigerati 17 hours ago

    Excellent reference Tried to name an AI project at work Electric Monk but too 'controversial'

    Had to change to Electric Mentor....

  • egeozcan 5 hours ago

    I'm afraid they will finish "The Salmon of Doubt" with AI and sell it to the future generations with a very small disclaimer, stating it's inspired by Douglas Adams.

    The possibility was already a topic in the series "Mozart in the jungle" where they made a robot which supposedly finished the Requiem piece by Mozart.

  • mikepurvis 17 hours ago

    SMBC had a pretty great take on this: https://www.smbc-comics.com/comic/summary

    • SchemaLoad 12 hours ago

      There was another comic where one worker uses AI to turn their prompt in to a verbose email, then on the receiver side they use AI to turn the verbose email in to a short summary.

      • drstewart 3 hours ago

        This one isn't a joke. 90% of documents produced at work are now AI generated, and nobody can keep up with the volume so they just summarise them with AI.

        What are we even doing.

    • AstroBen 16 hours ago

      This feels too real to laugh at

  • xeonmc 16 hours ago

    Now let’s hope that it will also save labour on resolving cloud infrastructure downtimes too.

  • tonyhart7 16 hours ago

    after outsource developer job, we can outsource all of manager job and leaving CEO with AI agentic code as its servant

    • aussieguy1234 16 hours ago

      Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.

      Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.

      "Were automating with AI" sounds better to investors than "We over hired and now need to downsize" or "We made some bad market bets, now need to free up cash flow"

      • stinkbeetle 15 hours ago

        > Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.

        > Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.

        I don't suppose these assertions are based on anything. If "AI" reduces the amount of time an engineer spends writing crud, boilerplate, test cases, random scripts, etc., and they have 5% more time to do other things, then all else being equal a project can be done with 5% fewer engineers.

        Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?

        • judahmeek 14 hours ago

          > Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?

          Between the disagreements regarding performance metrics, the fact that AI will happily increase its own scope of work as well as facilitate increasing any task, sprint, or projects scope of work, and Jevons Paradox, the world may never know the answer to either of these questions.

        • aussieguy1234 14 hours ago

          It does improve productivity, just like a good IDE. But engineers didn't get replaced by IDEs and they haven't yet been replaced by AI.

          By the time its good enough to replace actual engineers, any job done in front of a computer will be at risk. I'm hoping that will happen at the same time as AI embodiment in robots, then every job will be automated, not just computer based ones.

          • stinkbeetle 14 hours ago

            Your assertion was not that "an engineer has never been replaced by AI". It is that no engineer has been laid off because of AI.

            You agree AI improves engineer productivity. So last remaining question is, does greater productivity mean that fewer people are required to satisfy a given demand?

            The answer is yes of course. So at this point, supporting the assertion requires handwaving about shortages and induced demand and demand for engineers to develop and support AI and so on. Which are all reasonable, but it should become pretty apparent that you can't be confident in an assertion like that. I would say it's pretty likely that AI has resulted in engineers being laid off in specific instances if not the net numbers.

            • tonyhart7 8 hours ago

              this is true

              AI powered developer make 3x times the workload of "traditional" dev into one single developer

              therefore company didnt need to hire 3 people as a result, it literally kills job count

      • tonyhart7 15 hours ago

        "Not a single engineer has ever been laid off because of AI."

        are you insane??? big tech literally make one of the most biggest layoff for the past few months

        • slightwinder 15 minutes ago

          But not because of AI, they only use that as pretext for normal layoffs. Sometimes they also use it to hire cheaper workers fresh from school or a cheaper country, so just replacing expensive seniors.

        • aussieguy1234 14 hours ago

          That's because of overhiring and other non-ai related reasons (i.e. Higher interest rates means less VC funding available).

          In reality, getting AI to do actual human work, as of the moment, takes much more effort and cost than you get back in cost savings. These companies will claim they are using AI, even if its just a few engineers using Windsurf.

          The companies claim AI is the reason they laid off engineers to make it look like they're innovating, not downsizing, which makes them look better in the eyes of investors and shareholders.

          • wiz21c 2 hours ago

            in my own experience, using Claude gives me about 5-10% productivity increase because it's really good at writing boiler code or surgically modifying some code I didn't write.

tylervigen a day ago

I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).

The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.

The ARC puzzles in question: https://arcprize.org/arc-agi/2/

  • stephc_int13 21 hours ago

    What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.

    Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.

    But I think this is also fair to use any means to beat it.

    • tylervigen 21 hours ago

      I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.

      However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.

      • stephc_int13 18 hours ago

        The real strength of current neural nets/transformers relies on huge datasets.

        ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.

        Building your own large private ARC set does not seem too difficult if you have enough resources.

        • egeozcan 5 hours ago

          How can they keep it private? It's not like they can run these models locally. Do the providers promise not to peak when they are testing?

    • Blamklmo 16 hours ago

      Doesn't even matter at this point.

      We have a global RL Pipeline on our hand.

      If there is something new a LLM/AI model can't solve today, plenty of humans can't either.

      But tomorrow every LLM/AI model can solve it and again plent of humans still can't.

      Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.

    • benlivengood 15 hours ago

      That's ok; just start publishing your real problems to solve as "AI benchmarks" and then it'll work in ~6 months.

    • AstroBen 19 hours ago

      Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google

      • nomel 17 hours ago

        If they're memory/reference constrained systems that can't directly "store" every solution, then doing well on benchmarks should result in better real world/reasoning performance, since lack of memorized answer requires understanding.

        Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.

        [1] https://www.youtube.com/watch?v=f58kEHx6AQ8

      • stephc_int13 18 hours ago

        Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined.

        The trick is to not put more value in the score than what it is.

      • spprashant 18 hours ago

        Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.

    • simpsond 21 hours ago

      Humans study for tests. They just tend to forget.

    • riku_iki 16 hours ago

      > internal team to create an ARC replica, covering very similar puzzles

      they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.

      • energy123 15 hours ago

        The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.

        Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.

        Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.

        • riku_iki 15 hours ago

          they have two sets:

          - semi-private, which they use to test proprietary models and which could be leaked

          -private: used to test downloadable open source models.

          ARG-AGI prize itself is for open source models.

          • stephc_int13 13 hours ago

            My point is that it does not matter if the set is private or not.

            If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.

            It is not that hard, really, just tedious.

            • ld4nt3 9 hours ago

              Yes you can build your dataset of n puzzles but it was still really hard for any system to achieve any scores, it even beats specialized one for this just one task and this puzzles shouldn't really be possible just to be memorized by the amount of variations that can be created.

  • grantpitt a day ago

    Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard

    • energy123 16 hours ago

      It leads on arc-agi-1 with Gemini 3.0 Deep Think, which uses "tool calls" according to google's post, whereas regular Gemini 3.0 Pro doesn't use "tool calls" for the same benchmark. I am unsure how significant this difference is.

  • HarHarVeryFunny 19 hours ago

    There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.

    • knowriju 15 hours ago

      ARC-AGI has a hidden private test suite, right ? No model will have access to that set.

      • variadix 15 hours ago

        I doubt they have offline access to the model, i.e. the prompts are sent to the model provider.

        • xlbuttplug2 13 hours ago

          Even if the prompts are technically leaked to the provider, how would they be identified as something worth optimizing for out of the millions of other prompts received?

    • ld4nt3 9 hours ago

      Its almost certain that it was, but the purpose of this puzzle benchmark is that it shouldn't really be possible just to be memorized by the amount of variations that can be created and other criteria detailed in it.

  • m3kw9 17 hours ago

    that looks great, but we all care how it translate to real world problems like programming where it isn't really excelling by 2x.

ponyous a day ago

Just generated a bunch of 3D CAD models using Gemini 3.0 to see how it compares in spatial understanding and it's heaps better than anything currently out there - not only intelligence but also speed.

Will run extended benchmarks later, let me know if you want to see actual data.

  • lfx a day ago

    Just hand sketched what 5 year old would do on the paper - the house, trees, sun. And asked to generate 3d model with tree.js.

    Results are amazing! 2.5 and 3 seems way way head.

    • ponyous a day ago

      Based on my benchmarks (run 100s of model generations).

      2.5 stands between GPT-5 and GPT-5.1, where GPT-5 is the best of the 3.

      In preliminary evals Gemini 3 seems to be way better than all, but I will know when I run extended benchmarks tonight.

  • mindlessg 20 hours ago

    I'm interested in seeing the data.

  • layer8 20 hours ago

    Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?

  • giancarlostoro a day ago

    I'm not familiar enough with CAD what type of format is it?

    • ponyous a day ago

      It’s not a format, but in my mind it implies designs that are supposed to be functional as opposed to models that are meant for virtual games.

      It generated a blender script that makes the model.

      • adastra22 a day ago

        I would have used OpenSCAD for that purpose.

        • ponyous a day ago

          I started with a lighter weight solution (JSCAD) first and quickly hit the limitations. So I wanted to explore the other side of it - fully complex over the top software (blender).

          I guess openscad would be a sweet spot in the middle. Good shout, might experiment.

          • adastra22 a day ago

            Blender is not CAD. Edit: I’m not but picking. Totally different data structures and internal representations.

            • ponyous a day ago

              Computer aided design. Tree.js can be CAD. But I agree it’s not meant for CAD even though you can do it.

              • adastra22 21 hours ago

                Three.js is not CAD. It is an API for drawing 3D graphics in a browser. 3D graphics, in general, is not CAD. Blender is not CAD. You cannot do CAD operations in blender.

                I'm not being nit picky here. I think there are issues beyond terminology that you may not be familiar with, as it is clearly not your field. That's ok.

                The "design" in computer aided design is engineering design. This is not the same definition of "design" used in, say, graphic design. Something is not called CAD because it helps you create an image that looks like a product on a computer. It is CAD because it creates engineering design files (blueprints) that can be used for the physical manufacture of a device. This places very tight and important constraints on the methods used, and capabilities supported.

                Blender is a sculpting program. Its job is to create geometry that can be fed into a rendering program to make pretty pictures. Parasolid is a CAD geometry kernel at the core of many CAD programs, which has the job of producing manufacturable blueprints. The operations supported map to physical manufacturing steps - milling, lathe, and drill operations. The modeling steps use constraints in order to make sure, e.g., that screw holes line up. Blender doesn't support any of that.

                To an engineer, saying that an LLM gave you a blender script for a CAD operation is causing all sorts of alarm klaxons to go off.

                • lillecarl 35 minutes ago

                  Where does CAM? Fit into your view?

                  In high school CAD/CAM we used various CAD programs for designing (sculpting?) things and then imported them into CAM to generate g code programs, set tool constraints and such

                • ponyous 5 hours ago

                  Thanks for clarifying. I'm just getting into this field.

                  If Blender can export a .3mf file format and slicer gets it ready for 3D printing (gcode that actually instructs the printer head). Is the slicer actually CAD software?

                  And if you can export many formats that work with some manufacturing devices and you built a model in blender, did blender not help you with CAD?

                • ummonk 20 hours ago

                  Next they'll be doing PCB CAD in Photoshop...

      • bilbo0s a day ago

        Did your prompt instruct it to use blender?

        • ponyous a day ago

          Yes. I’ve been working and refining the prompt for some time now (months). It’s about 10k tokens now.

          • JulesRosser 19 hours ago

            Would you mind sharing the prompt please?

    • koakuma-chan a day ago

      When I see CAD, I always think of Casting Assistant Device.

syspec 20 hours ago

I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.

From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.

When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.

I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard

  • plaidfuji 13 hours ago

    I have the same experience with Gemini, that it’s incredibly accurate but puts in defensive code and error handling to a fault. It’s pretty easy to just tell it “go easy on the defensive code” / “give me the punchy version” and it cleans it up

    • joegibbs 12 hours ago

      Yes the defensive code is something that most models seem to struggle with - even Claude 4.5 Sonnet, even after explicitly prompting it not to - still adds pointless null checks and fallbacks in scripting languages where that something being null won't have any problems apart from an error being logged. I get this particularly when writing Angelscript for Unreal. This isn't surprising since as a niche language there's a lack of training data and the syntax is very similar to Unreal C++, which does crash to desktop when accessing a null reference.

  • poyu 15 hours ago

        but I would never want to have to interact with
    
    That is its job security ;)
  • jmkni 20 hours ago

    I can relate to this, it's doing exactly what I want, but it ain't pretty.

    It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself

  • eitally 14 hours ago

    I have had a similar experience vibe coding with Copilot (ChatGPT) in VSCode, against the Gemini API. I wanted to create a dad joke generator and then have it also create a comic styled 4 cel interpretation of the joke. Simple, right? I was able to easily get it to create the joke, but it repeatedly failed on the API call for the image generation. What started as perhaps 100 lines of total code in two files ended up being about 1500 LOC with an enormous built-in self-testing mechanism ... and it still didn't work.

coffeecoders a day ago

Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.

Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.

Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).

  • Workaccount2 a day ago

    AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.

  • bitpush 20 hours ago

    > Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.

    One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...

  • acoustics a day ago

    Microsoft hasn't been very quiet about it, at least in my experience. Every time I boot up Windows I get some kind of blurb about an AI feature.

    • CobrastanJorji 21 hours ago

      Man, remember the days where we'd lose our minds at our operating systems doing stuff like that?

      • esafak 19 hours ago

        The people who lost their minds jumped ship. And I'm not going to work at a company that makes me use it, either. So, not my problem.

  • int_19h 15 hours ago

    Gemini genuinely has an edge over the others in its super-long context size, though. There are some tasks where this is the deal breaker, and others where you can get by with a smaller size, but the results just aren't as good.

  • ehsankia 14 hours ago

    > The winners aren’t necessarily those with the best models

    Is there evidence that's true? That the other models are significantly better than the ones you named?

stevesimmons a day ago

A nice Easter egg in the Gemini 3 docs [1]:

    If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:

    "thoughtSignature": "context_engineering_is_the_way_to_go"
[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...
  • bijant 21 hours ago

    It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.

    • int_19h 16 hours ago

      I was under the impression that those reasoning outputs that you get back aren't references but simply raw CoT strings that are encrypted.

CMay 16 hours ago

I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.

Very good. Nice work! These things will definitely change lives.

__jl__ a day ago

API pricing is up to $2/M for input and $12/M for output

For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

  • raincole a day ago

    Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.

    • brianjking a day ago

      It is so impressive that Anthropic has been able to maintain this pricing still.

      • bottlepalm 21 hours ago

        Claude is just so good. Every time I try moving to ChatGPT or Gemini, they end up making concerning decisions. Trust is earned, and Claude has earned a lot of trust from me.

        Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be Google model.

        • int_19h 16 hours ago

          Well, it depends. Just recently I had Opus 4.1 spend 1.5 hours looking at 600+ sources while doing deep research, only to get back to me with a report consisting of a single sentence: "Full text as above - the comprehensive summary I wrote". Anthropic acknowledged that it was a problem on their side but refused to do anything to make it right, even though all I asked them to do was to adjust the counter so that this attempt doesn't count against their incredibly low limit.

        • epolanski 19 hours ago

          Idk Anthropic has the least consistent models out there imho.

      • Aeolun a day ago

        Because every time I try to move away I realize there’s nothing equivalent to move to.

        • Alex-Programs a day ago

          People insist upon Codex, but it takes ages and has an absolutely hideous lack of taste.

          • sumedh 7 hours ago

            It creates beautiful websites though.

  • dktp 16 hours ago

    It's interesting that grounding with search cost changed from

    * 1,500 RPD (free), then $35 / 1,000 grounded prompts

    to

    * 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries

    It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)

  • jhack a day ago

    With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.

  • fosterfriends a day ago

    Thrilled to see the cost is competitive with Anthropic.

  • hirako2000 a day ago

    [flagged]

    • mupuff1234 a day ago

      I assume the model is just more expensive to run.

      • hirako2000 a day ago

        Likely. The point is we would never know.

siva7 a day ago

I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine). Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions). So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).

  • MaxL93 20 hours ago

    > It seems very US centric in its thinking

    I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.

    • mpalmer 2 hours ago

      > Title Case (Everything is Capitalized Except the Prepositions)

      If this is an American thing I'm happy to disown/denounce it; it's my least favorite pattern in Gemini output.

crawshaw 21 hours ago

Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.

  • film42 21 hours ago

    At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.

    Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.

    • gordonhart 19 hours ago

      Matches my experience exactly. It's not the best at writing code but Gemini 2.5 Pro is (was) the hands-down winner in every other use case I have.

      This was hard for me to accept initially as I've learned to be anti-Google over the years, but the better accuracy was too good to pass up on. Still expecting a rugpull eventually — price hike, killing features without warning, changing internal details that break everything — but it hasn't happened yet.

  • mmaunder 12 hours ago

    Yes. I am. It is spectacular in raw cognitive horsepower. Smarter than gpt5-codex-high but Gemini CLI is still buggy as hell. But yes, 3 has been a game changer for me today on hardcore Rust, CUDA and Math projects. Unbelievable what they’ve accomplished.

  • Szpadel 19 hours ago

    I gave it a spin with instructions that worked great with gpt-5-codex (5.1 regressed a lot so I do not even compare to it).

    Code quality was fine for my very limited tests but I was disappointed with instruction following.

    I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.

    I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.

    this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.

    for codex those instructions make plan mode redundant.

    • m3kw9 17 hours ago

      just say "don't code yet" at the end. I never use plan mode because plan mode is just a prompt anyways.

  • Narciss 16 hours ago

    I've been working with it, and so far it's been very impressive. Better than Opus in my feels, but I have to test more, it's super early days

    • mewpmewp2 15 hours ago

      What I usually try to test with is try to get them do full scalable SaaS application from scratch... It seemed very impressive in how it did the early code organization using Antigravity, but then at some point, all of sudden it started really getting stuck and constantly stopped producing and I had to trigger continue, or babysit it. I don't know if I could've been doing something better, but that was just my experience. Seemed impressive at first, but otherwise at least vs Antigravity, Codex and Claude Code scale more reliably.

      Just early anecdote from trying to build that 1 SaaS application though.

      • Narciss 4 hours ago

        It sounds like an API issue more than anything. I was working with it through cursor on a side project, and it did better than all previous models at following instructions, refactoring, and UI-wise it has some crazy skills.

        What really impressed me was when I told it that I wanted a particular component’s UI to be cleaned up but I didn’t know how exactly, just wanted to use its deep design expertise to figure it out, and it came up with a UX that I would’ve never thought of and that was amazing.

        Another important point is that the error rate for my session yesterday was significantly lower than when I’ve used any other model.

        Today I will see how it does when I use it at work, where we have a massive codebase that has particular coding conventions. Curious how it does there.

ogig 14 hours ago

I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.

  • wraptile 11 hours ago

    I find it funny to find this almost exact same post in every new model release thread. Yet here we are - spending the same amount of time, if not more, finishing the rest of the owl.

    • cafebabbe 5 hours ago

      Seems like the whole world forgot what this job was really about :/

  • WhyOhWhyQ 11 hours ago

    I just spent 12 hours a day vibe coding for a month and a half with Claude (which has equal swe benchmarks at gemini 3). I started out terrified but eventually I realized that these are just remarkably far away from actually replacing a real software engineer. For prototypes they're amazing, but when you're just straight vibe coding you get stuck in a hell where you don't want to or can't efficiently really check what's going on under the hood but it's not really doing the thing you want.

    Basically these tools can you you to a 100k LOC project without much effort, but it's not going to be a serious product. A serious product requires understanding still.

  • osn9363739 11 hours ago

    Can you share the code?

    • rkozik1989 3 minutes ago

      No because this story didn't happen.

bnchrch a day ago

I've been so happy to see Google wake up.

Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.

- Gmail vs Outlook

- Drive vs Word

- Android vs iOS

- Worklife balance and high pay vs the low salary grind of before.

Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.

  • digbybk a day ago

    Ironically, OpenAI was conceived as a way to balance Google's dominance in AI.

    • kccqzy 19 hours ago

      Balance is too weak of a word. OpenAI was conceived specifically to prevent Google from getting AGI first. That was its original goal. At the time of its founding Google was the undisputed leader of AI anywhere in the world. Musk was then very worried about AGI being developed behind closed doors particularly Google, which was why he was the driving force behind the founding of OpenAI.

      • kranke155 17 hours ago

        The book Empire of AI describes him as being particularly fixated on Demis as some kind of evil genius. From the book, early OAI employees couldn’t take the entire thing too seriously and just focused on the work.

      • wraptile 11 hours ago

        > Musk was then very worried about AGI being developed behind closed doors

        *closed doors that aren't his

    • dragonwriter a day ago

      I thought it was a workaround to Google's complete disinterest in productizing the AI research it was doing and publishing, rather than a way to balance their dominance in a market which didn't meaningfully exist.

      • mattnewton 21 hours ago

        That’s how it turned out, but IIRC at the time of OpenAI’s founding, “AI” was search and RL which Google and deep mind were dominating, and self driving, which Waymo was leading. And OpenAI was conceptualized as a research org to compete. A lot has changed and OpenAI has been good at seeing around those corners.

      • jonny_eh 19 hours ago

        That was actually Character.ai's founding story. Two researchers at Google that were frustrated by a lack of resources and the inability to launch an LLM based chatbot. The founders are now back at Google. OpenAI was founded based on fears that Google would completely own AI in the future.

      • sgt101 19 hours ago

        I think that Google didn't see the business case in that generation of models, and also saw significant safety concerns. If AI had been delayed by... 5 years... would the world really be a worse place?

        Yes - less exciting! But worse?

      • jpadkins 21 hours ago

        Elon Musk specifically gave OAI $150M early on because of the risk of Google being the only Corp that has AGI or super-intelligence. These emails were part of the record in the lawsuit.

    • CobrastanJorji 21 hours ago

      Pffft. OpenAI was conceived to be Open, too.

      • lemoncucumber 21 hours ago

        It’s a common pattern for upstarts to embrace openness as a way to differentiate and gain a foothold then become progressively less open once they get bigger. Android is a great example.

        • bitpush 20 hours ago

          Last I checked, Android is still open source (as AOSP) and people can do whatever-the-f-they-want with the source code. Are we defining open differently?

          • lemoncucumber 19 hours ago

            I think we're defining "less" differently. You're interpreting "less open" to mean "not open at all," which is not what I said.

            There's a long history of Google slowly making the experience worse if you want to take advantage of the things that make Android open.

            For example, by moving features that were in the AOSP into their proprietary Play Services instead [1].

            Or coming soon, preventing sideloading of unverified apps if you're using a Google build of Android [2].

            In both cases, it's forcing you to accept tradeoffs between functionality and openness that you didn't have to accept before. You can still use AOSP, but it's a second class experience.

            [1] https://arstechnica.com/gadgets/2018/07/googles-iron-grip-on...

            [2] https://arstechnica.com/gadgets/2025/08/google-will-block-si...

          • ipaddr 20 hours ago

            Core is open source but for a device to be "Android compatible" and access the Google Play Store and other Google services, it must meet specific requirements from Google's Android Compatibility Program. These additional proprietary components are what make the final product closed source.

            The Android Open Source Project is not Android.

            • bitpush 15 hours ago

              > The Android Open Source Project is not Android.

              Was "Android" the way you define it ever open? Isnt it similar to chromium vs chrome? chromium is the core, and chrome is the product built on top of it - which is what allows Comet, Atlas, Brave to be built on.

              That's the same thing what GrapheneOS, /e/ OS and others are doing - building on top of AOSP.

              • oblio 10 hours ago

                > Was "Android" the way you define it ever open?

                Yes. Initially all the core OS components were OSS.

          • rvz 11 hours ago

            "open" and requiring closed blobs doesn't mean it's "open source".

            It's like saying Nvidia's drivers are "open source" as there is a repository there but has only binaries in the folders.

  • ThrowawayR2 a day ago

    They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.

    • bitpush 20 hours ago

      This is an understandable, but simplistic way of looking at the world. Are you also gonna blame Apple for mining for rare earths, because they made a successful product that requires exotic materials which needs to be mined from earth? How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?

      For every "OMG, internet is filled with ads", people are conveniently forgetting the real-world impact of ALL COMPANIES (and not just Apple) btw. Either you should be upset with the system, and not selectively at Google.

      • astrange 18 hours ago

        > How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?

        That would be bad if it happened, which is why it doesn't happen. Working in a factory isn't an inhumane condition.

      • fractalf 19 hours ago

        I dont think your comment justifies calling out any form of simplistic view. It doesnt make sense. All the big players are bad. They"re companies, their one and only purpose is to make money and they will do whatever it takes to do it. Most of which does not serve human kind.

        • jimbokun 18 hours ago

          Compared to what?

      • dieggsy 19 hours ago

        It seems okay to me to be upset with the system and also point out the specific wrongs of companies in the right context. I actually think that's probably most effective. The person above specifically singled out Google as a reply to a comment praising the company, which seems reasonable enough. I guess you could get into whether it's a proportional response; the praise wasn't that high and also exists within the context of the system as you point out. Still, their reply doesn't necessarily indicate that they're not upset with all companies or the system.

      • observationist 19 hours ago

        Yes, we're absolutely holding Apple accountable for outsourcing jobs, degrading the US markets, using slave and child labor, laundering cobalt from illegal "artisanal" mines in the DRC, and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.

        I also hold Americans and western consumers are responsible for simply allowing that to happen. As long as the human rights abuses and corruption are 3 or 4 degrees of separation from the retailer, people seem to be perfectly OK with chattel slavery and child labor and indentured servitude and all the human suffering that sits at the base of all our wonderful technology and cheap consumer goods.

        If we want to have things like minimum wage and workers rights and environmental protections, then we should mandate adherence to those standards globally. If you want to sell products in the US, the entire supply chain has to conform to US labor and manufacturing and environmental standards. If those standards aren't practical, then they should be tossed out - the US shouldn't be doing performative virtue signalling as law, incentivizing companies to outsource and engage in race to the bottom exploitation of labor and resources in other countries. We should also have tariffs and import/export taxes that allow competitive free trade. It's insane that it's cheaper to ship raw materials for a car to a country in southeast asia, have it refined and manufactured into a car, and then shipped back into the US, than to simply have it mined, refined, and manufactured locally.

        The ethics and economics of America are fucking dumb, but it's the mega-corps, donor class, and uniparty establishment politicians that keep it that way.

        Apple and Google are inhuman, autonomous entities that have effectively escaped the control and direction of any given human decision tree. Any CEO or person in power that tried to significantly reform the ethics or economics internally would be ousted and memory-holed faster than you can light a cigar with a hundred dollar bill. We need term limits, no more corporation people, money out of politics, and an overhaul, or we're going to be doing the same old kabuki show right up until the collapse or AI takeover.

        And yeah, you can single out Google for their misdeeds. They, in particular, are responsible for the adtech surveillance ecosystem and lack of any viable alternatives by way of their constant campaign of enshittification of everything, quashing competition, and giving NGOs, intelligence agencies, and government departments access to the controls of censorship and suppression of political opposition.

        I haven't and won't use Google AI for anything, ever, because of any of the big labs, they are most likely and best positioned to engage in the worst and most damaging abuse possible, be it manipulation, invasion of privacy, or casual violation of civil rights at the behest of bureaucratic tyrants.

        If it's not illegal, they'll do it. If it's illegal, they'll only do it if it doesn't cost more than they can profit. If they profit, even after getting caught and fined and taking a PR hit, they'll do it, because "number go up" is the only meaningful metric.

        The only way out is principled regulation, a digital bill of rights, and campaign finance reform. There's probably no way out.

        • astrange 18 hours ago

          > laundering cobalt from illegal "artisanal" mines in the DRC

          They don't, all cobalt in Apple products is recycled.

          > and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.

          They don't, Apple audits their entire supply chain so it wouldn't hide anything if something moved to another subcontractor.

          • sharpshadow 18 hours ago

            One can claim 100% recycled cobalt under the mass balance system even if recycled and non-recycled cobalt was mixed as long as the total amount used in production is less or equal to recycled cobalt purchased in the books. At least here[0] they claim their recycled cobalt references are under the mass balance system.

            0. https://www.apple.com/newsroom/2023/04/apple-will-use-100-pe...

        • jimbokun 18 hours ago

          Where is the fairy godmother's magic wand that will allow you to make all the governments of the world instantly agree to all of this?

    • ApolloFortyNine 19 hours ago

      People love getting their content for free and that's what Google does.

      Even 25 years ago people wouldn't even believe Youtube exists. Anyone can upload whatever they want, however often they want, Youtube will be responsible for promoting it, they'll provide to however many billions users want to view it, and they'll pay you 55% of the revenue it makes?

      • brabel 18 hours ago

        Yep, it's hard to believe it exists for free and with not a lot of ads when you have a good ad blocker... though the content creator's ads are inescapable, which I think is ok since they're making a little money in exchange for what, your little inconvenience for 1 minute or so - if you're not skipping the ad, which you aren't, right??) - after which you can watch some really good content. The history channels on YT are amazing, maybe world changing - they get people to learn history and actually enjoy it. Same with some match channels like 3brown1blue which are just outstanding, and many more.

      • amelius 19 hours ago

        > People love getting their content for free and that's what Google does.

        They are forcing a payment method on us. It's basically like they have their hand in our pockets.

    • visarga a day ago

      Yes, this is correct, and it happens everywhere. App Store, Play Store, YouTube, Meta, X, Amazon and even Uber - they all play in two-sided markets exploiting both its users and providers at the same time.

    • notepad0x90 20 hours ago

      They're not a moral entity. corporations aren't people.

      I think a lot of the harms you mentioned are real, but they're a natural consequence of capitalistic profit chasing. Governments are supposed to regulate monopolies and anti-consumer behavior like that. Instead of regulating surveillance capitalism, governments are using it to bypass laws restricting their power.

      If I were a google investor, I would absolutely want them to defeat ad-blocking, ban yt-dlp, dominate the ad-market and all the rest of what you said. In capitalism, everyone looks out for their own interests, and governments ensure the public isn't harmed in the process. But any time a government tries to regulate things, the same crowd that decries this oppose government overreach.

      Voters are people and they are moral entities, direct any moral outrage at us.

      • layer8 20 hours ago

        Why should the collective of voters be any more of a moral entity than the collective of people who make up a corporation (which you may include its shareholders in if you want)?

        It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.

        • notepad0x90 19 hours ago

          > Why should the collective of voters..

          They're accountable as individuals not as a collective. And it so happens, they are responsible for their government in a democracy but corporations aren't responsible for running countries.

          > It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.

          In the free speech sense, sure. But your criticism isn't founded on solid ground. You should expect corporations to do whatever they have to do within the bounds of the law to turn a profit. Their responsibility is to their investors and employees, they have no responsibility to the general public beyond that which is laid out in the law.

          The increasing demand in corporations being part of the public/social moral consciousness is causing them to manipulate politics more and more, eroding what little voice the individuals have.

          You're trying to live in a feudal society when you treat corporations like this.

          If you're unhappy with the quality of Google's services, don't do business with them. If they broke the law, they should pay for it. But expecting them to be a beacon of morality is accepting that they have a role in society and government beyond mere revenue generating machines. And if you expect them to have that role, then you're also giving them the right to enforce that expectation as a matter of corporate policy instead of law. Corporate policies then become as powerful as law, and corporations have to interfere with matters of government policy on the basis of morality instead of business, so you now have an organization with lots of money and resources competing with individual voters.

          And then people have the nerve to complain about PACs, money in politics, billionaire's influencing the government, bribery,etc.. you can't have it both ways. Either we have a country run partly by corporations, and a society driven and controlled by them, or we don't.

          • layer8 18 hours ago

            When we criticize corporations, we really are criticizing the people who make the decisions in the corporations. I don’t see why we shouldn’t apply exactly the same moral standards to people’s decision in the context of a corporation as we do to people’s decisions made in any other context. You talk about lawfulness, but we wouldn’t talk about morals if we meant lawfulness. It’s also lawful to vote for the hyper-capitalist party, so by the same token moral outrage shouldn’t be directed towards the voters.

            • notepad0x90 17 hours ago

              I get that, but those CEOs are not elected officials, they don't represent us and have no part in the discourse of law making (despite the state of things). In their capacity has executives of a company, they have no rights, no say in what we find acceptable or not in society. We tell them what they can and cannot do or else. That's the social contract we have with companies and their executives.

              Being in charge of a corporation shouldn't elevate someone to a platform where they have a louder voice than the common man. They can vote just as equally as others at the voting booth. they can participate in their capacity as individuals in politics. But neither money, nor corporate influence have places in the governance of a democratic society.

              I talk about lawfulness because that is the only rule of law a corporation can and should be expected to follow. Morals are for individuals. Corporations have no morals. they are neither moral or immoral. Their owners have morals, and you can criticize their greed, but that is a construct of capitalism. They're supposed to enrich themselves. You can criticize them for valuing money over morals, but that's like criticizing the ocean for being wet or the sun for being too hot. It's what they do. It's their role in society.

              If a small business owner raises prices to increase revenue, that isn't immoral right? even though poor people that frequent them will be disaffected? amp that up to the scale of a megacorp, and the morality is still the same.

              Corporations are entities that exist for the sole purpose of generating revenue for their owners. So when you criticize Google, you're criticizing a logical organization designed to do the thing you're criticizing it of doing. The CEO of google is acting in his official capacity, doing the job they were hired to do when they are resisting adblocking. The investors of Google are risking their money in anticipation of ROI, so their expectation from Google is valid as well.

              When you find something to be immoral, the only meaningful avenue of expressing that with corporations is the law. You're criticizing google as if it was an elected official we could vote in/out of office. or as if it is an entity that can be convinced of its moral failings.

              When we don't speak up and user our voice, we lose it.

        • svnt 20 hours ago

          Because of the inherent capitalism structure that leads to the inevitable: the tragedy of the commons.

      • ThrowawayR2 20 hours ago

        Why are you directing the statement that "[Corporations are] not a moral entity" at me instead of the parent poster claiming that "[Google has] been the great balancing force (often for good) in the industry."? Saying that Google is a force "for good" is a claim by them that corporations can be moral entities; I agree with you that they aren't.

        • notepad0x90 19 hours ago

          I could have just the same I suppose, but their comment was about google being a balancing force in terms of competition and monopoly. it wasn't a praise of their moral character. They did what was best for their business and that turns out to be good for reducing monopolies. If it turned out to be monopolistic, I would be wondering what congress and the DOJ are doing about it, instead of criticizing Google for trying to turn a profit.

    • kryogen1c 19 hours ago

      > They've poisoned the internet

      And what of the people that ravenously support ads and ad-supported content, instead of paying?

      What of the consumptive public? Are they not responsible for their choices?

      I do not consume algorithmic content, I do not have any social media (unless you count HN for either).

      You can't have it both ways. Lead by example, stop using the poison and find friends that aren't addicted. Build an offline community.

      • xordon 19 hours ago

        I don't understand your logic, it seems like victim blaming. Using the internet and pointing out that targeted advertising has a negative effect on society is not "having it both ways".

        Also, HN is by definition algorithmic content and social media, in your mind what do you think it is?

        • carlosjobim 19 hours ago

          You are not a "victim" for using or purchasing something which is completely unnecessary. Or if that's the case, then you have no agency and have to be medicinally declared unfit to govern yourself and be appointed a legal guardian to control your affairs.

    • starchild3001 19 hours ago

      What kind of world do you live in? Actually Google ads tend to be some of the highest ROI for the advertiser and most likely to be beneficial for the user. Vs the pure junk ads that aren't personalized, and just banner ads that have zero relationship to me. Google Ads is the enabler of free internet. I for one am thankful to them. Else you end up paying for NYT, Washinton Post, Information etc -- virtually for any high quality web site (including Search).

      • shakna 19 hours ago

        Ads. Beneficial to the user.

        Most of the time, you need to pick one. Modern advertising is not based on finding the item with the most utility for the user - which means they are aimed at manipulating the user's behaviour in one way or another.

    • nwienert 20 hours ago

      Suppressed wages to colluding with Apple to not poach.

  • epolanski 19 hours ago

    Outlook is much better than Gmail and so is the office suite.

    It's good there's competition in the space though.

    • brailsafe 19 hours ago

      Outlook is not better in ways that email or gmail users necessarily care about, and in my experience gets in the way more than it helps with productivity or anything it tries to be good at. I've used it in office settings because it's the default, but never in my life have I considered using it by choice. If it's better, it might not matter.

    • vanillax 19 hours ago

      I couldn't disagree more

  • redbell 21 hours ago

    > Drive vs Word

    You mean Drive vs OneDrive or, maybe Docs vs Word?

  • 63stack a day ago

    - Making money vs general computing

  • drewda 21 hours ago

    For what it's worth, most of those examples are acquisitions. That's not a hit against Google in particular. That's the way all big tech co's grow. But it's not necessarily representative of "innovation."

    • charcircuit 21 hours ago

      >most of those examples are acquisitions

      Taking those products from where there were to the juggernauts they are today was not guaranteed to succeed, nor was it easy. And yes plenty of innovation happened with these products post aquisition.

      • hvb2 20 hours ago

        But there's also plenty that fail, it's just that you won't know about those.

        I don't think what you're saying proves that the companies that were acquired couldn't have done that themselves.

  • storus 21 hours ago

    If you consider surveillance capitalism and dark pattern nudges a good thing, then sure. Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up".

  • qweiopqweiop a day ago

    Forgot to mention absolutely milking every ounce of their users attention with Youtube, plus forcing Shorts!

    • bitpush 20 hours ago

      Why stop at YouTube? Blame Apple for creating an additive gadget that has single handedly wasted billions of hours of collective human intelligence. Life was so much better before iPhones.

      But I hear you say - you can use iPhones for productive things and not just mindless brainrot. And that's the same with YouTube as well. Many waste time on YouTube, but many learn and do productive things.

      Dont paint everything with a single, large, coarse brush stroke.

    • polotics 20 hours ago

      frankly when compared against TikTok, Insta, etc, YouTube is a force for good. Just script the shorts away...

  • kevstev 21 hours ago

    All those examples date back to the 2000s. Android has seen some significant improvements, but everything else has stagnated if not enshittified- remember when google told us not to ever worry about deleting anything?- and then started backing up my photos without me asking and are now constantly nagging me to pay them a monthly fee?

    They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.

  • IlikeKitties 20 hours ago

    Something about bringing balance to the force not destroying it.

  • rvz a day ago

    Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.

    You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.

    [0] https://news.ycombinator.com/item?id=34713073

  • stephc_int13 21 hours ago

    Google is using the typical monopoly playbook as most other large orgs, and the world would be a "better place" if they are kept in check.

    But at least this company is not run by a narcissistic sociopath.

  • samdoesnothing 19 hours ago

    Seriously? Google is an incredibly evil company whose net contribution to society is probably only barely positive thanks to their original product (search). Since completely de-googling I've felt a lot better about myself.

aliljet a day ago

Understanding precisely why Gemini 3 isn't front of the pack on SWE Bench is really what I was hoping to understand here. Especially for a blog post targeted at software developers...

  • Workaccount2 21 hours ago

    It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage.

    • epolanski 19 hours ago

      Imho Gemini 2.5 was by far the better model on non-trivial tasks.

      • oezi 19 hours ago

        To this day, I still don't understand why Claude gets more acclaim for coding. Gemini 2.5 consistently outperformed Claude and ChatGPT mostly because of the much larger context.

        • WhyOhWhyQ 13 hours ago

          I'm not sure about this. I used gemini and claude for about 12 hours a day for a month and a half straight in an unhealthy programmer bender and claude was FAR superior. It was not really that close. Going to be interesting to test gemini 3 though.

          • davidmurdoch 2 hours ago

            Gemini 2.5 is prone to apology loops, and often confuses its own thinking to user input, replying to itself. Chat GPT 5 likes to refuse tasks with "sorry I can't help with that". At least in VSCode's GitHub Copilot Agent mode. Claude hasn't screwed up like that for me.

        • viraptor 18 hours ago

          Different styles of usage? I see Gemini praised for being able to feed the whole project and ask changes. Which is cool and all but... I never do that. Claude for me is better for specific modifications to specific parts of the app. There's a lot of context behind what's "better".

          • Libidinalecon 15 hours ago

            I can't really explain why I have barely used Gemini.

            I think it was just timing with the way models came out. This will be the first time I will have a Gemini subscription and nothing else. This will be the first time I really see what it can do fully.

        • decide1000 8 hours ago

          I use Gemini cli, Claude Code and Codex daily. If I present the same bug to all 3, Gemini often is the one missing a part of the solution or drawing the wrong conclusion. I am curious for G3.

        • nhumrich 14 hours ago

          The secret sauce isn't Claude the model, but Claude code the tool. Harness > model.

          • brazukadev 14 hours ago

            The secret sauce is the MCP that lots of people are starting to talk bad about.

        • artdigital 14 hours ago

          Claude doesn’t gaslight me, or flat out refuses to do something I ask it to because it believes it won’t work anyway. Gemini does

          Gemini also randomly just reverts everything because of some small mistake it found, makes assumptions without checking if those are true (eg this lib absolutely HAS TO HAVE a login() method. If we get a compile error it’s my env setup fault)

          It’s just not a pleasant model to work with

        • dist-epoch 16 hours ago

          Gemini 2.5 couldn't apply an edit to a file if it's life depended on it.

          So unless you love copy/pasting code, Gemini 2.5 was useless for agentic coding.

          Great for taking it's output and asking Sonnet to apply it though.

    • ramesh31 20 hours ago

      >"It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage."

      Indeed. It's almost impossible to truly know a model before spending a few million tokens on a real world task. It will take a step-change level advancement at this point for me to trust anything but Claude right now.

  • svantana a day ago

    SWEBench-Verified is probably benchmaxxed at this stage. Claude isn't even the top performer, that honor goes to Doubao [1].

    Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.

    [1] https://www.swebench.com/

    • usaar333 21 hours ago

      claude 4.5 gets 82% on their own highly customized scaffolding. (parallel compute with a scoring function). That beats Doubao

  • cube2222 a day ago

    Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) and are supposedly leading in, while for some reason SWE Bench is down from Sonnet 4.5.

    Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.

    • nico1207 21 hours ago

      They are not even leading in Terminal-Bench... GPT 5.1-codex is better than Gemini 3 Pro

  • pawelduda a day ago

    Why is this particular benchmark important?

    • aliljet a day ago

      Thus far, this is one of the best objective evaluations of real world software engineering...

      • RamtinJ95 a day ago

        I concur with the other commenters, 4.5 is a clear improvement over 4.

      • adastra22 a day ago

        Idk, Sonnet 4.5 score better than Sonnet 4.0 on that benchmark, but is markedly worse in my usage. The utility of the benchmark is fading as it is gamed.

        • meowface a day ago

          I think I and many others have found Sonnet 4.5 to generally be better than Sonnet 4 for coding.

          • adastra22 a day ago

            Maybe if you confirm to its expectations for how you use it. 4.5 is absolutely terrible for following directions, thinks it knows better than you, and will gaslight you until specifically called out on its mistake.

            I have scripted prompts for long duration automated coding workflows of the fire and forget, issue description -> pull request variety. Sonnet 4 does better than you’d expect: it generates high quality mergable code about half the time. Sonnet 4.5 fails literally every time.

            • pawelduda 21 hours ago

              I'm very happy with it TBH, it has some things that annoy me a little bit:

              - slower compared to other models that will also do the job just fine (but excels at more complex tasks),

              - it's very insistent on creating loads of .MD files with overly verbose documentation on what it just did (not really what I ask it to do),

              - it actually deleted a file twice and went "oops, I accidentaly deleted the file, let me see if I can restore it!", I haven't seen this happen with any other agent. The task wasn't even remotely about removing anything

              • adastra22 20 hours ago

                The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).

                And yes, I have hooks to disable 'git reset', 'git checkout', etc., and warn the model not to use these commands and why. So it writes them to a bash script and calls that to circumvent the hook, successfully shooting itself in the foot.

                Sonnet 4.5 will not follow directions. Because of this, you can't prevent it like you could with earlier models from doing something that destroys the worktree state. For longer-running tasks the probability of it doing this at some point approaches 100%.

                • ewoodrich 19 hours ago

                  > The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).

                  Man I've had this exact thing happen recently with Sonnet 4.5 in Claude Code!

                  With Claude I asked it to try tweaking the font weight of a heading to put the finishing touches on a new page we were iterating on. Looked at it and said, "Never mind, undo that" and it nuked 45 minutes worth of work by running git restore.

                  It immediately realized it fucked up and started running all sorts of git commands and reading its own log trying to reverse what it did and then came back 5 minutes later saying "Welp I lost everything, do you want me to manually rebuild the entire page from our conversation history?

                  In my CLAUDE.md I have instructions to commit unstaged changes frequently but it often forgets and sure enough, it forgot this time too. I had it read its log and write a post-mortem of WTF led it to run dangerous git commands to remove one line of CSS and then used that to write more specific rules about using git in the project CLAUDE.md, and blocked it from running "git restore" at all.

                  We'll see if that did the trick but it was a good reminder that even "SOTA" models in 2025 can still go insane at the drop of a hat.

                  • adastra22 15 hours ago

                    The problem is that I'm trying to build workflows for generating sequences of good, high quality semantically grouped changes for pull requests. This requires having a bunch of unrelated changes existing in the work tree at the same time, doing dependency analysis on the sequence of commits, and then pulling out / staging just certain features at a time and committing those separately. It is sooo much easier to do this by explicitly avoiding the commit-every-2-seconds workaround and keeping things uncommitted in the work tree.

                    I have a custom checkpointing skill that I've written that it is usually good about using, making it easier to rewind state. But that requires a careful sequence of operations, and I haven't been able to get 4.5 to not go insane when it screws up.

                    As I said though, watch out for it learning that it can't run git restore, so it immediately jumps to Bash(echo "git restore" >file.sh && chmod +x file.sh && ./file.sh).

            • meowface 20 hours ago

              I think this is probably just a matter of noise. That's not been my experience with Sonnet 4.5 too often.

              Every model from every provider at every version I've used has intermingled brilliant perfect instruction-following and weird mistaken divergence.

        • epolanski 19 hours ago

          Not my experience at all, 4.5 is leagues ahead the previous models albeit not as good as Gemini 2.5.

        • pertymcpert a day ago

          I find 4.5 a much better model FWIW.

  • spookie a day ago

    Does anyone trust benchmarks at this point? Genuine question. Isn't the scientific consensus that they are broken and poor evaluation tools?

    • energy123 21 hours ago

      They overly emphasize tasks with small context without noise and red herrings in the context.

    • mudkipdev a day ago

      I make my own automated benchmarks

      • ummonk 20 hours ago

        Is there a tool / website that makes this process easy?

        • mudkipdev 20 hours ago

          I coded it bun and openrouter(dot)ai. I have an array of benchmarks, each benchmark has a grader (for example, checking if it equals a certain string or grade the answer automatically using another LLM). Then I save all results to a file and render the percentage correct to a graph

  • ezekiel68 7 hours ago

    I mean... it achieved 76.2% vs the leader (Claude Sonnet) at 77.2%.

    That's a "loss" I can deal with.

zone411 21 hours ago

Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).

Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.

Gemini 2.5 Pro scored 57.6, so this is a huge improvement.

mparis 19 hours ago

I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.

The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.

Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.

People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".

Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.

Despite model supremacy, the products still matter.

golfer a day ago
  • tweakimp a day ago

    Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?

    • rvnx a day ago

      This is a list of questions and answers that was created by different people.

      The questions AND the answers are public.

      If the LLM manages through reasoning OR memory to repeat back the answer then they win.

      The scores represent the % of correct answers they recalled.

      • tylervigen 18 hours ago

        That is not entirely true. At least some of these tests (like HLE and ARC) take steps to keep the evaluation set private so that LLMs can’t just memorize the answers.

        You could question how well this works, but it’s not like the answers are just hanging out on the public internet.

        • slaterbug 12 hours ago

          Excuse my ignorance, how do these companies evaluate their models against the evaluation set without access to it?

          • ricopags 6 hours ago

            Cooperation with the eval admins

    • stavros a day ago

      I estimate another 7 months before models start getting 115% on Humanity's Last Exam.

  • HardCodedBias a day ago

    If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.

    The person also claims that with thinking on the gap narrows considerably.

    We'll probably have 3rd party benchmarks in a couple of days.

mccoyb a day ago

I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.

Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.

Where do I go for this for these Google models?

  • dktp 21 hours ago

    Google actually changed it somewhat recently (3 months ago, give or take) and you can use Gemini CLI with the "regular" Google AI Pro subscription (~22eur/month). Before that, it required a separate subscription

    I can't find the announcement anymore, but you can see it under benefits here https://support.google.com/googleone/answer/14534406?hl=en

    The initial separate subscriptions were confusing at best. Current situation is pretty much same as Anthropic/OpenAI - straightforward

    Edit: changed ~1 month ago (https://old.reddit.com/r/Bard/comments/1npiv2o/google_ai_pro...)

    • mccoyb 21 hours ago

      I see -- but does this allow me to us the models within "Antigravity" with the same subscription?

      I poked around and couldn't figure this out.

      • dktp 21 hours ago

        I don't know either tbh. I wouldn't be surprised it the answer is no (and it will come later or something like that)

        I also tried to use Gemini 3 in my Gemini CLI and it's not available yet (it's available to all Ultra, but not all Pro subscribers), I needed to sign up to a waitlist

        All in all, Google is terrible at launching things like that in a concise and understandable way

        • cmrdporcupine 14 hours ago

          Back in the early 00s having a 'waitlist' for gmail with invites was an exciting buzz-making marketing technique and justifiable technically.

          This is just irritating. I am not going to give them money until I know I can try their latest thing and they've made it hard for me to even know how I can do that.

          • thelittleone 11 hours ago

            early gmail invite codes went for like $100 if I recall correctly..

      • jstummbillig 21 hours ago

        Might not be decided yet. The AG pricing page says:

        "Public preview Individual plan $0/month"

        "Coming soon Team plan"

    • 8note 21 hours ago

      how do i actually make it use that though? i got a free year of subscription from buying a phone, but all i get is the free tier in the gemini cli

      • dktp 21 hours ago

        I also got 1 year through buying my pixel. If you login with the same account through Gemini CLI, it should work (works for me)

        However, Gemini CLI is a rather bad product. There is (was?) an issue that makes the CLI fall back to flash very soon in every session. This comment explains it well: https://news.ycombinator.com/item?id=45681063

        I haven't used it in a while, except for really minor things, so I can't tell if this is resolved or not

  • mantenpanther 20 hours ago

    I am paying for AI ultra - no idea how to use it in the CLI. It says i dont‘t have access. The google admin/payment backend is pure evil. What a mess.

    • thelittleone 7 hours ago

      My test a few hours ago. Ultra plan got me ~20 minutes with Antigravity using Gemini 3 Pro (Low) before zero out.

      • artdigital 7 hours ago

        Getting only 20 minutes of usage with a $240/mo plan is a bit ridiculous. How much usage did you get on 2.5-pro? Is it comparable to Claude Max or ChatGPT Pro on the CLI? So a weekly limit but in reality very hard to hit and mostly 'unlimited' unless very heavy usage?

  • fschuett a day ago

    Update VSCode to the latest version and click the small "Chat" button at the top bar. GitHub gives you like $20 for free per month and I think they have a deal with the larger vendors because their pricing is insanely cheap. One week of vibe-coding costs me like $15, only downside to Copilot is that you can't work on multiple projects at the same time because of rate-limiting.

    • mccoyb a day ago

      I'm asking about Gemini, not Copilot.

      • jstummbillig a day ago

        You were asking about the model. You can use the model (Gemini 3 Pro) in Github Chat.

        • mccoyb a day ago

          Got it -- thanks both.

  • closewith 21 hours ago

    Yeah, it truly is an outstandingly bad UX. To use Gemini CLI as a business user like I would Codex or Claude Code, how much and how do I pay?

    • rafaquintanilha 21 hours ago

      You can install the Gemini CLI (https://github.com/google-gemini/gemini-cli) but assign a "paid" API key to it (unless you pay for Gemini Ultra).

      • closewith 18 hours ago

        So where do I get a API key? Where do I sign up for Ultra?

        • int_19h 15 hours ago

          For API key, go to https://aistudio.google.com/ and there's a link in the bottom left.

          But this is if you want to pay per token. Otherwise you should just be able to use your Gemini Pro subscription (it doesn't need Ultra). Subscriptions are at https://gemini.google/subscriptions/

          • closewith 6 hours ago

            Okay, thanks. Unfortunately, when I try to sign up to a plan on https://gemini.google/subscriptions/, I am redirected to the Workspace Admin (as I'm a business user and One is only available to personal accounts), where I am offered Google Ultra AI for Business for €216 per month, but I can only upgrade the entire Workspace or nothing!

            Is that correct? I can't even upgrade a Group separately?

  • ur-whale 14 hours ago

    > I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.

    I had the exact same experience and walked away to chatgpt.

    What a mess.

  • kachapopopow a day ago

    ai studio, you get a bunch of usage free if you want more you buy credits (google one subscriptions also give you some additional usage)

  • dboreham a day ago

    Also Google discontinues everything in short order, so personally I'm waiting until they haven't discontinued this for, say 6 months, before wasting time evaluating it.

    • mh- 21 hours ago

      It's really impressive how much damage they've done to early adoption by earning themselves this reputation.

      I've even heard it in mainstream circles that have no idea what HN is, and aren't involved in tech.

      Probably would have been cheaper to keep Google Reader running - kidding, but this is the first time I remember the gut punch of Google cancelling something I heavily used personally.

      • ianburrell 19 hours ago

        Google is bad about maintenance. They have a bunch of projects that are not getting changes.

        They are also bad about strategy. Good example is the number of messaging systems that have had. Instead of making new ones, they should have updated existing one with new backend and UI.

        I like the Google Messages sync SMS online with Google Fi, but it is missing features. If they could do it globally, they would have something big.

    • KingMob 4 hours ago

      Generally a good idea with Google, but if the pace of model releases keeps up, nobody will be running 6-month-old models from anyone.

bityard 21 hours ago

> Whether you’re an experienced developer or a vibe coder

I absolutely LOVE that Google themselves drew a sharp distinction here.

  • rafaquintanilha 21 hours ago

    You realize this is copy to attract more people to the product, right?

svantana a day ago

Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.

[1] https://lmarena.ai/leaderboard/text

  • inkysigma a day ago

    Is it just me or is that link broken because of the cloudflare outage?

    Edit: nvm it looks to be up for me again

  • dyauspitr 21 hours ago

    Grok is heavily censored though

    • KingMob 4 hours ago

      Is it censored... or just biased towards edge-lord MechaHitler nonsense whenever Musk feels like tinkering with the system prompt?

yomismoaqui 21 hours ago

From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.

My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)

The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.

It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...

primaprashant 8 hours ago

Created a summary of comments from this thread about 15 hours after it had been posted and had 814 comments with gemini-3-pro and gpt-5.1 using this script [1]:

- gemini-3-pro summary: https://gist.github.com/primaprashant/948c5b0f89f1d5bc919f90...

- gpt-5.1 summary: https://gist.github.com/primaprashant/3786f3833043d8dcccae4b...

Summary from GPT 5.1 is significantly longer and more verbose compared to Gemini 3 Pro (13,129 output tokens vs 3,776). Gemini 3 summary seems more readable, however, GPT 5.1 one has interesting insights missed by Gemini.

Last time I did this comparison at the time of GPT 5 release [2], the summary from Gemini 2.5 Pro was way better and readable than the GPT 5 one. This time the readability of Gemini 3 summary still seems great while GPT 5.1 feels a bit more improved but not there quite yet.

[1]: https://gist.github.com/primaprashant/f181ed685ae563fd06c49d...

[2]: https://news.ycombinator.com/item?id=44835029

rubymamis 3 hours ago

I gave it the task to recreate StackView.qml to be feel more native on iOS and it failed - like all other models...

Prompt:

Instead of the current StackView, I want you to implement a new StackView that will have a similar api with the differences that:

1. It automatically handles swiping to the previous page/item. If not mirrored, it should detect swiping from the left edge, if mirrored it should detect from the right edge. It's important that swiping will be responsive - that is, that the previous item will be seen under the current item when swiping - the same way it's being handled on iOS applications. You should also add to the api the option for the swipe to be detected not just from the edge, but from anywhere on the item, with the same behavior. If swiping is released from x% of current item not in view anymore than we should animate and move to the previous item. If it's a small percentage we should animate the current page to get back to its place as nothing happened. 2. The current page transitions are horrible and look nothing like native iOS transitions. Please make the transitions feel the same.

mpeg a day ago

Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.

Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)

markdog12 19 hours ago

I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047

  • strange_quark 19 hours ago

    I’ve never seen such a huge delta between advertised capabilities and real world experience. I’ve had a lot of very similar experiences to yours with these models where I will literally try verbatim something shown in an ad and get absolutely garbage results. Do these execs not use their own products? I don’t understand how they are even releasing this stuff.

  • BoorishBears 19 hours ago

    The default FPS it's analyzing video at is 1, and I'm not sure the max is anywhere near enough to catch a full speed tennis serve.

sd9 a day ago

How long does it typically take after this to become available on https://gemini.google.com/app ?

I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

  • mpeg a day ago

    Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code

    Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.

    • sd9 a day ago

      Thanks for the hint about Canvas/2.5. I have access to 3.0 in AI Studio now, and I agree the results are very similar.

  • netdur 19 hours ago

    On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro

  • Squarex a day ago

    Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.

  • Romario77 21 hours ago

    It's available in cursor. Should be there pretty soon as well.

    • ionwake 21 hours ago

      are you sure its available in cursor? ( I get: We're having trouble connecting to the model provider. This might be temporary - please try again in a moment. )

santhoshr a day ago

Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png

  • xnx a day ago

    2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...

    • knownjorbist a day ago

      Did you notice that this embedded a Gemini API connection within the app itself? Or am I not understanding what that is?

      • xnx a day ago

        I hadn't! It looks like that is there to power the text box at the bottom of the app that allows for AI-powered changes to the scene.

    • agnosticmantis 19 hours ago

      This says Gemini 2.5 though.

      • xnx 19 hours ago

        Good observation. The app was created with Gemini 3 Pro Preview, but the app calls out to Gemini 2.5 if you use the embedded prompt box.

  • mohsen1 a day ago

    Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?

    • AstroBen a day ago

      IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?

  • robterrell a day ago

    At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.

    • notatoad a day ago

      i think anything that makes it clear they've done that would be a lot worse PR than failing the pelican test would ever be.

      • imiric 21 hours ago

        It would be next to impossible for anyone without insider knowledge to prove that to be the case.

        Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.

        So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.

  • bn-l a day ago

    It’s a good pelican. Not great but good.

    • cubefox 19 hours ago

      The blue lines indicating wind really sell it.

DanMcInerney 20 hours ago

A 50% increase over ChatGPT 5.1 on ARC-AGI2 is astonishing. If that's true and representative (a big if), it lends credence to this being the first of the very consistent agentically-inclined models because it's able to follow a deep tree of reasoning to solve problems accurately. I've been building agents for a while and thus far have had to add many many explicit instructions and hardcoded functions to help guide the agents in how to complete simple tasks to achieve 85-90% consistency.

  • machiaweliczny 20 hours ago

    I think it's due to improvements in vision basically, the arc agi 2 is very visual

    • machiaweliczny 19 hours ago

      Vision is very far from solved IMO, simple modifications to inputs results in high differences still, lines aren't recognized etc..

  • puttycat 20 hours ago

    Where is this figure taken from?

wohoef a day ago

Curious to see it in action. Gemini 2.5 has already been very impressive as a study buddy for courses like set theory, information theory, and automata. Although I’m always a bit skeptical of these benchmarks. Seems quite unlikely that all of the questions remain out of their training data.

bilekas a day ago

> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing

Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.

  • lalitmaganti a day ago

    > Gemini app surpasses 650 million users per month

    Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.

    Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.

    • edaemon a day ago

      I unlocked my phone the other day and had the entire screen taken over with an ad for the Gemini app. There was a big "Get Started" button that I almost accidentally clicked because it was where I was about to tap for something else.

      As an Android and Google Workspace user, I definitely feel like Google is "pushing their AI into everything they can", including the Gemini app.

    • mewpmewp2 19 hours ago

      I constantly accidentally use some btn and Gemini opens up on my Samsung Galaxy. I haven't bothered to figure this out.

    • aniforprez a day ago

      I don't know for sure but they have to be counting users like me whose phone has had Gemini force installed on an update and I've only opened the app by accident while trying to figure out how to invoke the old actually useful Assistant app

    • realusername a day ago

      > it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.

      Yes and no, my power button got remapped to opening Gemini in an update...

      I removed that but I can imagine that your average user doesn't.

  • Yizahi 21 hours ago

    This is benefit of bundling, I've been forecasting this for a long time - the only companies who would win the LLM race would be the megacorps bundling their offerings, and at most maybe OAI due to the sheer marketing dominance.

    For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.

  • joaogui1 a day ago

    It says Gemini App, not AI Overviews, AI Mode, etc

    • recitedropper a day ago

      They claim AI overviews as having "2 billion users" in the sentences prior. They are clearly trying as hard as possible to show the "best" numbers.

      • bitpush 15 hours ago

        > They are clearly trying as hard as possible to show the "best" numbers.

        This isnt a hottake at all. Marketing (iPhone keynotes, product launches) are about showing impressive numbers. It isnt a gotcha you think it is.

  • alecco 21 hours ago

    Yeah my business account was forced to pay for an AI. And I only used it for a couple of weeks when Gemini 2.5 was launched, until it got nerfed. So they are definitely counting me there even though I haven't used it in like 7 months. Well, I try it once every other month to see if it's still crap, and it always is.

    I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.

  • blinding-streak a day ago

    Gemini app != Google search.

    You're implying they're lying?

    • AstroBen a day ago

      And you're implying they're being 100% truthful?

      Marketing is always somewhere in the middle

      • bitpush 20 hours ago

        Companies cant get away from egregious marketing. See Apple class action lawsuit for Apple Intelligence.

srameshc a day ago

I think I am in this AI fatigue phase. I am past all hype with models, tools and agents and back to problem and solution approach, sometimes code gen with AI , sometimes think and ask for a piece of code. But not offloading to AI and buying all the bs, waiting it to do magic with my codebase.

  • amelius a day ago

    Yeah, at this point I want to see the failure modes. Show me at least as many cases where it breaks. Otherwise, I'll assume it's an advertisement and I'll skip to the next headline. I'm not going to waste my time on it anymore.

  • jstummbillig a day ago

    I think it's fun to see what is not even considered magic anymore today.

    • yomismoaqui 21 hours ago

      Our ability to adapt to new things is both a blessing and a curse.

    • shiandow 20 hours ago

      It is. But understandably the people who need to push back on what is still magic may get a bit tired.

    • mountainriver 21 hours ago

      People would have had a heart attack if they saw this 5 years ago for the first time. Now artificial brains are “meh” :)

      • jimbokun 18 hours ago

        It is anything but "meh".

        It scares the absolute shit out of everyone.

        It's clear far beyond our little tech world to everyone this is going to collapse our entire economic system, destroy everyone's livelihoods, and put even more firmly into control the oligarchic assholes already running everything and turning the world to shit.

        I see it in news, commentary, day to day conversation. People get it's for real this time and there's a very real chance it ends in something like the Terminator except far worse.

      • root_axis 21 hours ago

        True of almost every new technology.

        • abound 20 hours ago

          I hesitate to lump this into the "every new technology" bucket. There are few things that exist today that, similar to what GP said, would have been literal voodoo black magic a few years ago. LLMs are pretty singular in a lot of ways, and you can do powerful things with them that were quite literally impossible a few short years ago. One is free to discount that, but it seems more useful to understand them and their strengths, and use them where appropriate.

          Even tools like Claude Code have only been fully released for six months, and they've already had a pretty dramatic impact on how many developers work.

          • asadotzler 20 hours ago

            More people got more value out of iPhone, including financially.

  • Kiro a day ago

    I agree but if Gemini 3 is as good as people on HN said about the preview, then this is the wrong announcement to sleep on.

    • bigstrat2003 21 hours ago

      No LLM has ever been as good as people said it was. That doesn't mean this one won't be, but it does make it an unlikely bet based on past trends.

      • sebzim4500 20 hours ago

        With the exception of GPT-5, which was a significant advance yet because it was slightly less sycophantic than gpt-4o the internet decided it was terrible for the first few days.

      • jennyholzer 20 hours ago

        "No LLM has ever been as good as people said it was."

        The reason for this is because LLM companies have tuned their models to aggressively blow smoke up their users' asses.

        These "tools" are designed to aggressively exploit human confirmation bias, so as to prevent the user from identifying their innumerable inadequacies.

      • echelon 21 hours ago

        There are 8 Google news articles in the top 15 articles on the HN front page right now.

        Google being able to skip ahead of every other AI company is wild. They just sat back and watched, then decided it was time to body the competition.

        The DOJ really should break up Google [1]. They have too many incumbent advantages that were already abuse of monopoly power.

        [1] https://pluralpolicy.com/find-your-legislator/ - call your reps and tell them!

        • syspec 20 hours ago

          2.5 flash and 2.5 Pro were just sitting back and watching?

          The problem with Google is that someone had to show them how to make a product out of the thing, which Open AI did.

          Then Anthropic taught them to make a more specific product out of there models

          In every aspect, they're just playing catch up, and playing me too.

          Models are only part of the solution

        • SchemaLoad 12 hours ago

          Google didn't sit back and watch, they basically built the whole foundations for all of this. They were just not the first ones to release a chatbot interface.

        • recitedropper 18 hours ago

          Astroturfing used as evidence of domination. Public forums truly have come full circle.

    • redsymbol a day ago

      Why?

      Not trying to challenge you, and I'd sincerely love to read your response. People said similar things about previous gen-AI tool announcements that proved over time to be overstated. Is there some reason to put more weight in "what people on HN said" in this case, compared to previous situations?

      • rvz a day ago

        Because either:

        1. They likely work at the company (and have RSUs that need to go up)

        2. Also invested in the company in the open market or have active call options.

        3. Trying to sell you their "AI product".

        4. All of the above.

    • embedding-shape 21 hours ago

      Only reasonable thing is to not listening to anyone who seem to be hyping anything, LLMs or otherwise. Wait until the thing gets released, run your private benchmarks against it, get a concrete number, compare against existing runs you've done before.

      I don't see any other way of doing this. The people who keep reading and following comments either here on HN, from LocalLlama or otherwise will continue to be misinformed by all the FUD and guerilla marketing that is happening across all of these places.

  • SchemaLoad 12 hours ago

    My test for the state of AI is "Does Microsoft Teams still suck?", if it does still suck, then clearly the AIs were not capable of just fixing the bugs and we must not be there yet.

  • m3kw9 21 hours ago

    it's not AI fatigue, its that you just need to shift mode to not pay attention too much to the latest and greatest as they all leap frog each other each month. Just stick to one and ride it thru ups and downs.

  • strangescript 21 hours ago

    And by this time next year, this comment is going to look very silly

mil22 a day ago

It's available to be selected, but the quota does not seem to have been enabled just yet.

"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

"You've reached your rate limit. Please try again later."

Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.

  • sarreph a day ago

    Looks to be available in Vertex.

    I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.

  • CjHuber a day ago

    For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more

  • lousken a day ago

    I hope some users will switch from cerebras to free up those resources

  • r0fl a day ago

    Works for me.

  • misiti3780 a day ago

    seeing the same issue.

    • sottol a day ago

      you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key.

      when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".

      i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.

nickandbro a day ago

What we have all been waiting for:

"Create me a SVG of a pelican riding on a bicycle"

https://www.svgviewer.dev/s/FfhmhTK1

  • Thev00d00 a day ago

    That is pretty impressive.

    So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.

    • burkaman a day ago

      Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

      • jmmcd a day ago

        "Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.

        • BoorishBears 19 hours ago

          The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.

          Like replacing named concepts with nonsense words in reasoning benchmarks.

          • jmmcd 6 hours ago

            Yes. But "the gold standard" just means "the most natural, easy and dumb way".

    • rixed a day ago

      I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.

  • bitshiftfaced a day ago

    It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.

    • xnx a day ago

      Very aero

senfiaj 19 hours ago

Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.

dudeinhawaii 21 hours ago

Gemini has been so far behind agentically it's comical. I'll be giving it a shot but it has a herculean task ahead of itself. It has to not only be "good enough" but a "quantum leap forward".

That said, OpenAI was in the same place earlier in the year and very quickly became the top agentic platform with GPT-5-Codex.

The AI crowd is surprisingly not sticky. Coders quickly move to whatever the best model is.

Excited to see Gemini making a leap here.

  • ryandrake 19 hours ago

    I don't even know what the fuck "agentic" is or why the hell I would want it all over my software. So tired of everything in the computing world today.

    • SchemaLoad 12 hours ago

      As far as I can tell, it just means giving the LLM the ability to run commands, read files, edit files, and run in a loop until some goal is achieved. Compared to chat interfaces where you just input text and get one response back.

    • esafak 19 hours ago

      Prompting, planning, iteration, coding, and tool use over an entire code base until a problem is solved.

    • ur-whale 14 hours ago

      > So tired of everything in the computing world today.

      That's actually sad, and if you're - like I am - long in the tooth in computer land, you should definitely try agentic in CLI mode.

      I haven't been that excited to play with a computer in 30 years.

  • catigula 20 hours ago

    Claude is still a better agent for software professionals though it is less capable, so there isn't nothing to having the incumbent advantage.

    • lifty 20 hours ago

      Not my experience. Codex is the top coding model in my experience and has been since it’s out. Makes fewer mistakes and understands better my intentions.

      • onidj 2 hours ago

        This wasn't my experience at all.

        I tried Codex for a short while but quickly went back to Claude. Found myself having to revert Codex changes all the time. Maybe I had subconsciously altered my workflow/prompting to work well with Claude, but somehow wasn't providing Codex with the correct context, not sure.

      • catigula 20 hours ago

        My purposeful caveat was 'software professionals', i.e. user in the loop engineering. Codex is much better at slinging slop that you later need to spend some time reviewing if you actually want to understand it.

King-Aaron 14 hours ago

> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month

"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.

taf2 2 hours ago

I had asked earlier in the day for gpt 5.1 high to refactor my apex visualforce page into a lightning component and it really didn’t do much here - Gemini 3 pro crushed this task… very promising

nighwatch 21 hours ago

I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising. As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.

It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.

jacky2wong 3 hours ago

What I loved about this release was that it was hyped up by a polymarket leak with insider trading - NOT with nonsensical feel the AGI hype. Great model that's pushed the frontier of spatial reasoning by a long shot.

icyfox a day ago

Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:

Input: $1.25 -> $2.00 (1M tokens)

Output: $10.00 -> $12.00

Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.

  • rudedogg a day ago

    Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned. The input price almost doubled with this one.

    • icyfox 21 hours ago

      I'm not sure how concerned people should be at the trend lines. If you're building a product that already works well, you shouldn't feel the need to upgrade to a larger parameter model. If your product doesn't work and the new architectures unlock performance that would let you have a feasible business, even a 2x on input tokens shouldn't be the dealbreaker.

      If we're paying more for a more petaflop heavy model, it makes sense that costs would go up. What really would concern me is if companies start ratcheting prices up for models with the same level of performance. My hope is raw hardware costs and OSS releases keep a lid on the margin pressure.

misja111 7 hours ago

I asked Gemini to solve today's Countle puzzle (https://www.countle.org/). It got stuck while iterating randomly trying to find a solution. While I'm writing this it has been trying already for 5 minutes and the web page has become unresponsive.

I also asked it for the best play when in backgammon opponent rolls 6-1 (plays 13/7 8/7) and you roll 5-1. It starts alright with mentioning a good move (13/8 6/5) but continues to hallucinate with several alternative but illegal moves. I'm not too impressed.

jpkw 15 hours ago

Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.

  • rocqua 15 hours ago

    Those numbers are too good to expect. If 90% right 10% wrong is the baseline would you take as an improvement:

    - 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%

    The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.

    Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.

    • fwip 10 hours ago

      I think you may have misread. They stated that they'd be willing to go from 90% correct to 10% correct for this tradeoff.

  • energy123 14 hours ago

    OpenAI uses SimpleQA to assess hallucinations

recitedropper 20 hours ago

Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.

Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.

No benchmark is safe, when this much money is on the line.

  • sosodev 19 hours ago

    Here's some insight from Jeff Dean and Noam Shazeer's interview with Dwarkesh Patel https://youtu.be/v0gjI__RyCY&t=7390

    > When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?

    > Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.

    • recitedropper 18 hours ago

      I'm sure each of the frontier labs have some secret methods, especially in training the models and the engineering of optimizing inference. That said, I don't think them saying they'd keep a big breakthrough secret would be evidence in this case of a "secret sauce" on ARC-AGI-2.

      If they had found something fundamentally new, I doubt they would've snuck it into Gemini 3. Probably would cook on it longer and release something truly mindblowing. Or, you know, just take over the world with their new omniscient ASI :)

  • HarHarVeryFunny 19 hours ago

    I'd also be curious what kind of tools they are providing to get the jump from Pro to Deep Think (with tools) performance. ARC-AGI specialized tools?

  • horhay 19 hours ago

    They ran the tests themselves only on semi-private evals. Basically the same caveat as when o3 supposedly beat ARC1

energy123 8 hours ago

With the $20/m subscription, do we get it on "Low" or "High" thinking level?

qustrolabe a day ago

Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models

icapybara a day ago

Anyone know how Gemini CLI with this model compares to Codex and Claude Code?

creddit 19 hours ago

Gemini 3 is crushing my personal evals for research purposes.

I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.

It's really, really, really good so far. Wow.

Note that I haven't tried it for coding yet!

  • ethmarks 19 hours ago

    Genuinely curious here: why is the desktop app so important?

    I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA?

    Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.

    • creddit 19 hours ago

      I have a few reasons for the preference:

      (1) The ability to add context via a local apps integration into OS level resources is big. With Claude, eg, I hit Option-SPC which brings up a prompt bar. From there, taking a screenshot that will get sent my prompt is as simple as dragging a bounding box. This is great. Beyond that, I can add my own MCP connectors and give my desktop app direct access to relevant context in a way that doesn't work via web UI. It may also be inconvenient to give context to a web UI in some case where, eg, I may have a folder of PDFs I want it to be able to reference.

      (2) Its own icon that I can CMD-TAB to is so much nicer. Maybe that works with a PWA? Not really sure.

      (3) Even if I can't use an LLM when offline, having access to my chats for context has been repeatedly valuable to me.

      I haven't looked at provider-agnostic apps and, TBH, would be wary of them.

      • ethmarks 18 hours ago

        > The ability to add context via a local apps integration into OS level resources is big

        Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.

        > Its own icon that I can CMD-TAB to is so much nicer

        Fair enough. I personally prefer Firefox's tab organization to my OS's window organization, but I can see how separating the LLM into its own window would be helpful.

        > having access to my chats for context has been repeatedly valuable to me.

        I didn't at all consider this. Point ceded.

        > I haven't looked at provider-agnostic apps and, TBH, would be wary of them.

        Interesting. Why? Is it security? The ones I've listed are open source and auditable. I'm confident that they won't steal my API keys. Msty has a lot of advanced functionality that I haven't seen in other interfaces like allowing you to compare responses between different LLMs, export the entire conversation to Markdown, and edit the LLM's response to manage context. It also sidesteps the problem of '[provider] doesn't have a desktop app' because you can use any provider API.

        • creddit 17 hours ago

          > Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.

          Access to OS level resources != context pollution. You still have control, just more direct and less manual.

          > The ones I've listed are open source and auditable.

          Yeah I don't plan on spending who knows how much time auditing some major app's code (lol) before giving it my API keys and access to my chats. Unless there's a critical mass of people I know and trust using something like that it's not going to happen for me.

          But also, I tried quickly looking up Msty to see if it is open source and what its adoption looked like and AFAICT it's not open source. Asked Gemini 3 if it was and it also said no. Frankly that makes it a very hard no for me. If you are using it because you think it's Open Source I suggest you stop.

          • ethmarks 17 hours ago

            > If you are using it because you think it's Open Source I suggest you stop.

            I did not know that. Thank you very much for the correction. I guess I have some keys to revoke now.

  • energy123 14 hours ago

    I would personally settle for a web app that isn't slow. The difference in speed (latency, lag) between ChatGPT's fast web app and Gemini's slow web app is significant. AI Studio is slightly better than Gemini, but try pasting in 80k tokens and then typing some additional text and see what happens.

bespokedevelopr a day ago

Wow so the polymarket insider bet was true then..

https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...

  • giarc a day ago

    These prediction markets are so ripe for abuse it's unbelievable. People need to realize there are real people on the other side of these bets. Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call. These types of bets shouldn't be allowed.

    • ATMLOTTOBEER 19 hours ago

      It’s not really abuse though. These markets aggregate information; when an insider takes one side of a trade, they are selling their information about the true price (probability of the thing happening) to the market (and the price will move accordingly).

      You’re spot on that people should think of who is on the other side of the trades they’re taking, and be extremely paranoid of being adversely selected.

      Disallowing people from making terrible trades seems…paternalistic? Idk

      • simianwords 17 hours ago

        You don’t get it. Allowing insiders to trade disincentivizes normal people from putting money. Why else is it not allowed in stock market?

        • ATMLOTTOBEER 16 hours ago

          Why should normal people be incentivized to make trades on things they probably haven’t got the slightest idea about

    • ethmarks 21 hours ago

      The point of prediction markets isn't to be fair. They are not the stock market. The point of prediction markets is to predict. They provide a monetary incentive for people who are good at predicting stuff. Whether that's due to luck, analysis, insider knowledge, or the ability to influence the result is irrelevant. If you don't want to participate in an unfair market, don't participate in prediction markets.

      • giarc 16 hours ago

        But what's the point of predicting how many times Elon will say "Trump" on an earnings call (or some random event Kalshi or Polymarket make up)? At least the stock market serves a purpose. People will claim "prediction markets are great for price discovery!" Ok. I'm so glad we found out the chance of Nicki Minaj saying "Bible" during some recent remarks. In case you were wondering, the chance peaked at around 45% and she did not say 'bible'! She passed up a great opportunity to buy the "yes" and make a ton of money!

        https://kalshi.com/markets/kxminajmention/nicki-minaj/kxmina...

        • ethmarks 13 hours ago

          I agree that the "will [person] say [word]" markets are stupid. "Will Brian Armstrong say the word 'Bitcoin' in the Q4 earnings call" is a stupid market because nobody a actually cares whether or not he actually says 'Bitcoin', they care about whether or not Coinbase is focusing on Bitcoin. If Armstrong manipulates the market by saying the words without actually doing anything, nobody wins except Armstrong. "Will Coinbase process $10B in Bitcoin transactions in Q4" is a much better market because, though Armstrong could still manipulate the market's outcome, his manipulation would influence a result that people actually care about. The existence of stupid markets doesn't invalidate the concept.

      • suddenlybananas 16 hours ago

        That argument works for insider training too.

        • ethmarks 14 hours ago

          And? Insider trading is bad because it's unfair, and the stock market is supposed to be fair. Prediction markets are not fair. If you are looking for a fair market, prediction markets are not that. Insider trading is accepted and encouraged in prediction markets because it makes the predictions more accurate, which is the entire point.

          • suddenlybananas 14 hours ago

            The stock market isn't supposed to be fair.

            • ethmarks 13 hours ago

              By 'fair', I mean 'all parties have access to the same information'. The stock market is supposed to give everyone the same information. Trading with privileged information (insider trading), is illegal. Publicly traded companies are required to file 10-Qs and 10-Ks. SEC rule 10b5-1 prohibits trading with material non-public information. There are measures and regulations in place to try to make the stock market fair. There are, by design, zero such measures with prediction markets. Insider trading improves the accuracy of prediction markets, which is their whole purpose to begin with.

    • Dilettante_ 20 hours ago

      >Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call.

      For the kind of person playing these sorts of games, that actually really "hype".

    • HDThoreaun a day ago

      I’m pretty sure that these model release date markets are made to be abused. They’re just a way to pay insiders to tell you when the model will be released.

      The mention markets are pure degenerate gambling and everyone involved knows that

      • ATMLOTTOBEER 19 hours ago

        Correct, and this is actually how all markets work in the sense that they allow for price discovery :)

    • ur-whale 14 hours ago

      > people need to realize there are real people on the other side of these bets

      None of whom were forced by anyone to place bets in the first place.

    • FergusArgyll 20 hours ago

      Abuse sounds bad, this is good! Now we have a sneak peek into the future, for free! Just don't bet on any markets where an insider has knowledge (or don't bet at all)

JacobiX 20 hours ago

Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph). Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues. For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better. For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.

  • skrebbel 20 hours ago

    > it nailed it, but only solved it partially

    Hey either it nailed it or it didn't.

    • joaogui1 20 hours ago

      Probably figured out the exact cause of the bug but not how to solve it

    • JacobiX 20 hours ago

      Yes; they nailed the root case but the implementation is not 100% correct

eknkc 21 hours ago

Looks like it is already available on VSCode Copilot. Just tried a prompt that was not returning anything good on Sonnet 4.5. (Did not spend much time though, but the prompth was already there on the chat screen so I switched the model and sent it again)

Gemini 3 worked much better and I actually committed the changes that it created. I don't mean its revolutionary or anything but it provided a nice summary of my request and created a decent simple solution. Sonnet had created a bunch of overarching changes that I would not even bother reviewing. Seems nice. Will probably use it for 2 weeks until someone else releases a 1.0001x better model.

  • flyinglizard 20 hours ago

    You were probably stuck at some local model minima avoidable by simply changing the model to something else.

mrinterweb 20 hours ago

Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.

ponyous a day ago

Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:

- GPT-5 medium is the best

- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster

Really wonder how well Gemini 3 will perform

BugsJustFindMe 18 hours ago

The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.

syedshahmir7214 21 hours ago

I think from last few releases of these models from all companies, I have not observed much improvements in the response of these models. Their claims and launches are a little over hyped.

GodelNumbering a day ago

And of course they hiked the API prices

Standard Context(≤ 200K tokens)

Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)

Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)

Long Context(> 200K tokens)

Input $4.00 vs $2.50 (same +60%)

Output $18.00 vs $15.00 (same +20%)

  • panarky a day ago

    Claude Opus is $15 input, $75 output.

  • xnx 14 hours ago

    If the model solves your needs in fewer prompts, it costs less.

  • CjHuber a day ago

    Is it the first time long context has separate pricing? I hadn’t encountered that yet

    • 1ucky a day ago

      Anthropic is also doing this for long context >= 200k Tokens on Sonnet 4.5

    • Topfi a day ago

      Google has been doing that for a while.

    • brianjking a day ago

      Google has always done this.

      • CjHuber a day ago

        Ok wow then I‘ve always overlooked that.

davide_benato 17 hours ago

I would love to see how Gemini 3 can solve this particular problem. https://lig-membres.imag.fr/benyelloul/uherbert/index.html

It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.

The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.

What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.

aerhardt 20 hours ago

Combining structured outputs with search is the API feature I was looking for. Honestly crazy that it wasn’t there to start with - I have a project that is mostly Gemini API but I’ve had to mix in GPT-5 just for this feature.

I still use ChatGPT and Codex as a user but in the API project I’ve been working on Gemini 2.5 Pro absolutely crushed GPT-5 in the accuracy benchmarks I ran.

As it stands Gemini is my de facto standard for API work and I’ll be following very closely the performance of 3.0 in coming weeks.

kanodiaayush 12 hours ago

I don't really understand the amount of ongoing negativity in the comments. This is not the first time a product has been near copied, and the experience for me is far superior to code in a terminal. It comes with improvements even though imperfect, and I'm excited for those! I've long wanted the ability to comment on code diffs instead of just writing things back down in chat. And I'm excited for the quality of gemini 3.0 pro; although I'm running into rate limits. I can already tell its something I'm going to try out a lot!

  • rvnx 12 hours ago

    It's not really good for real-life programming though, it invents lot of imaginary things, cannot respect its own instructions, forgets basic things (variable is called "bananaDance", then claims it is "bananadance", then later on "bananaDance" again).

    It is good at writing something from scratch (like spitting out its training set).

    Claude is still superior for programming and debugging. Gemini is better at daily life questions and creative writing.

    • kanodiaayush 12 hours ago

      yeah testing it out! good to know the above. My feel also is that claude is better so far.

      • rvnx 12 hours ago

        It's not bad at all though, but it needs lot a baby-sitting like "try again, try this, try that, are you sure that it is correct ?"

        For example, in a basic python script that uses os.path.exists, it forgets the basic "import os", and then, "I apologize for the oversight".

        • kanodiaayush 12 hours ago

          Similar stuff my end; I'm coding up a complex feature - Claude would have taken fewer interventions on my part, and would have been non buggy right off the bat. But apart from that the experience is comparable.

pclark 11 hours ago

I just want Gemini to access ALL my Google Calendars, not just the primary one. If they supported this I would be all in on Gemini. Does no one else want this?

iib 17 hours ago

As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.

I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.

I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.

CephalopodMD 20 hours ago

What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.

gertrunde a day ago

"AI Overviews now have 2 billion users every month."

"Users"? Or people that get presented with it and ignore it?

  • mNovak 20 hours ago

    Maybe you ignore it, but Google has stated in the past that click-through rates with AI overviews are way down. To me, that implies the 'user' read the summary and got what they needed, such that they didn't feel the need to dig into a further site (ignoring whether that's a good thing or not).

    I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.

    • gertrunde 20 hours ago

      You're right, I'm probably being a little uncharitable!

      Normal users (i.e. not grumpy techies ;) ) probably just go with the flow rather than finding it irritating.

  • singhrac a day ago

    They're a bit less bad than they used to be. I'm not exactly happy about what this means to incentives (and rewards) for doing research and writing good content, but sometimes I ask a dumb question out of curiosity and Google overview will give it to me (e.g. "what's in flower food?"). I don't need GPT 5.1 Thinking for that.

  • recitedropper a day ago

    "Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."

    Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.

auggierose 9 hours ago

> Gemini 3 is the best vibe coding and agentic coding model we’ve ever built

Google goes full Apple...

aliljet a day ago

When will this be available in the cli?

  • _ryanjsalva a day ago

    Gemini CLI team member here. We'll start rolling out today.

    • evandena 21 hours ago

      How about for Pro (not Ultra) subscribers?

    • aliljet a day ago

      This is the heroic move everyone is waiting for. Do you know how this will be priced?

alach11 21 hours ago

This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4. Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it.

deanc a day ago

The AntiGravity seems to be a bit overwhelmed. Unable to set up an account at the moment.

briga 21 hours ago

Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.

But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.

  • stephc_int13 21 hours ago

    Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on.

    Given the nature of how those models work, you don't need exact replicas.

Retr0id 19 hours ago

> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month

Do regular users know how to disable AI Overviews, if they don't love them?

  • jeron 19 hours ago

    it's as low tech as using adblock - select element and block

    • Retr0id 2 hours ago

      Blocking the UI elements probably won't stop you from contributing to Google's usage stats.

clusterhacks a day ago

I wish I could just pay for the model and self-host on local/rented hardware. I'm incredibly suspicious of companies totally trying to capture us with these tools.

  • lfx a day ago

    Technically you can!

    I haven't seen it in the box yet, and pricing is unknown https://cloud.google.com/blog/products/ai-machine-learning/r...

    • clusterhacks 21 hours ago

      That's interesting. While I suspect the pricing will lean heavily into enterprise sales rather than personal licenses, I personally like the idea buying models that I then own and control. Any steps from companies that make that more possible is great.

mikeortman a day ago

Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.

Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.

I'm concerned.

zurfer a day ago

It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.

  • mudkipdev a day ago

    Isn't it the same cutoff as 2.5?

oezi 17 hours ago

Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).

gigatexal 2 hours ago

How does it do in coding tasks? I’ve been absolutely spoiled by Claude sonnet 4.5 thinking.

nilsingwersen a day ago

Feeling great to see something confidential

RobinL a day ago

- Anyone have any idea why it says 'confidential'?

- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)

[Edit: working for me now in ai studio]

elcapithanos 9 hours ago

> AI overviews now have 2 billion users every month

More like 2 billion hostages

hamasho 15 hours ago

I just googled latest LLM models and this page appears at the top. It looks like Gemini Pro 3 can score 102% in high school math tests.

I_am_tiberius 19 hours ago

I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.

  • gpm 19 hours ago

    > I would even pay for it.

    Is it just me or is it generally the case that to pay for anything on the internet you have to enter credit card information including a phone number.

    • I_am_tiberius 19 hours ago

      You never have to add your phone number in order to pay.

      • gpm 19 hours ago

        While I haven't tried leaving the field blank on every credit card form I've come across, I'm certain that at least some of them considered it required.

        Perhaps its country specific?

        • I_am_tiberius 16 hours ago

          I've never been asked a phone number. Maybe country specific. no idea.

zen_boy 14 hours ago

Is the "thinking" dropdown option on gemini.google.com what the blog post refers to as Deep Think?

agentifysh 17 hours ago

my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium

gpt-5.1 honestly looking very comfortable given available usage limits and pricing

although gpt-5.1 used from chatgpt website seems to be better for some reason

Sonnet 4.5 agentic coding still holding up well and confirms my own experiences

i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for

energy123 21 hours ago

Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.

sunaookami 21 hours ago

Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).

Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.

thedelanyo a day ago

Reading the introductory passage - all I can say now is, Ai is here to stay.

jdthedisciple 18 hours ago

What I'd prefer over benchmarks is the answer to a simple question:

What useful thing can it demonstrably do that its predecessors couldn't?

  • Ridius 16 hours ago

    Keep the bubble expanding for a few months longer.

raffkede 15 hours ago

Seems to be the first model that one-shots my secret benchmark about nested SQLite and it did it in 30s,

  • osn9363739 11 hours ago

    Out of interest. Does it one shot it every time?

    • raffkede 9 hours ago

      Will try again just tried once in the phone a few hours ago, other models were able to do quite a lot but usually missing some stuff this time it managed nested navigation quite well, lot of stuff missing for sure I just tested the basics with the play button in AI studio

      • osn9363739 9 hours ago

        It seems to be that first impression that makes all the difference. Especially with the randomness that comes with llms in general. which maybe explains the 'wow this is so much better' vs the 'this is no better than xxx' commments littered throughout this whole parent post.

oceanplexian 20 hours ago

Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?

bilsbie 16 hours ago

Is there a way to use this without being in the whole google ecosystem? Just make a new account or something?

  • mtremsal 16 hours ago

    If you mean the "consumer ecosystem", then Gemini 3 should be available as an API through Google's AI Vertex platform. If you don't even want a Google Cloud account, then I think the answer is no unless they announce a partnership with an inference cloud like cerebras.

  • tim333 15 hours ago

    You could probably do a new account. I have the odd junk google account.

CjHuber a day ago

Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over

bluecalm 3 hours ago

I've asked it (thinking 3) about the difference between Plus and Pro plans. First it thought I am asking for comparison between Gemini and ChatGPT as it claimed there is no "Plus" plan on Gemini. After I insisted I am on this very plan right now it apologized and told me it in fact exists. Then it told me the difference is that I got access to newer models with the Pro subscription. That is despite Google's own plan comparison page showing I get access to the Gemini 3 on both plans.

It also told me that on Plus I am most likely using "Flash" model. There is no "Flash" model in the dropdown to choose from. There is only "Fast" and "Thinking". It then told me "Fast" is just renamed Flash and it likely uses Gemini 2.5. On the product comparison page there is nothing about 2.5, it only mentions version 3 for both Plus and Pro plans. Of course on the dropdown menu it's impossible to see which model it is really using.

How can a normal person understand their products when their own super advanced thinking/reasoning model that took months to train on world's most advanced hardware can't?

It's amazing to me they don't see it as an epic failure in communication and marketing.

visioninmyblood 15 hours ago

Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators

AstroBen 20 hours ago

First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5

taikahessu 17 hours ago

Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.

Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.

eterm 11 hours ago

> It seems there's a date conflict. The prompt claims it's 2025, but my internal clock says otherwise.

> I'm now zeroing in on the temporal aspect. Examining the search snippets reveals dates like "2025-10-27," suggesting a future context relative to 2024. My initial suspicion was that the system time was simply misaligned, but the consistent appearance of future dates strengthens the argument that the prompt's implied "present" is indeed 2025. I am now treating the provided timestamps as accurate for a simulated 2025. It is probable, however, that the user meant 2024.

Um, huh? It's found search results for October 2025, but this has led it to believe it's in a simulated future, not a real one?

pflenker 21 hours ago

> Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.

Come on, you can’t be serious.

  • muzani 16 hours ago

    This is so disingenuous that it hurts the credibility of the whole thing.

slackerIII a day ago

What's the easiest way to set up automatic code review for PRs for my team on GitHub using this model?

serjester a day ago

It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.

  • aoeusnth1 21 hours ago

    Maybe they're slow rolling the announcements to be in the news more

    • coffeebeqn 21 hours ago

      Most likely. And/or they use the full model to train the smaller ones somehow

DrNosferatu 16 hours ago

Anyone has any idea if/when it’s coming to paid Perplexity?

nprateem 5 hours ago

OMG they've obviously had a major breakthrough because now it can reply to questions with actual answers instead of shit blog posts.

maczwei 14 hours ago

entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps

NullCascade a day ago

I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.

Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?

Try lowering the temperature, use SymPy etc.

  • ducttapecrown 21 hours ago

    Terry Tao is writing about this on his blog.

jordanpg a day ago

What is Gemini 3 under the hood? Is it still just a basic LLM based on transformers? Or are there all kinds of other ML technologies bolted on now? I feel like I've lost the plot.

  • meowface a day ago

    I am very ignorant in this field but I am pretty sure under the hood they are all still fundamentally built on the transformer architecture, or at least innovations on the original transformer architecture.

  • anilgulecha a day ago

    It's a mixture-of-experts model. Basically N smaller model pieces put together, and when inference occurs, only 1 is active at a time. Each model piece would be tuned/good in one area.

  • becquerel 20 hours ago

    The industry is still seeing how far they can take transformers. We've yet to reach a dollar value where it stops being worth pumping money into them.

fosterfriends a day ago

Gemini 3 and 3 pro are good bit cheaper than Sonnet 4.5 as well. Big fan

iamA_Austin 6 hours ago

it started with OpenAI and Google took the competition damn seriously.

hekkle 16 hours ago

GOOGLE: "We have a new product".

REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.

- Microsoft Code

- Gemeni

- Chrome Browser

keepamovin 17 hours ago

I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?

ilaksh 20 hours ago

okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.

pgroves 21 hours ago

I was hoping Bash would go away or get replaced at some point. It's starting to look like it's going to be another 20 years of Bash but with AI doodads.

  • __MatrixMan__ 21 hours ago

    Nushell scratches the itch for me 95% of the time. I haven't yet convinced anybody else to make the switch, but I'm trying. Haven't yet fixed the most problematic bug for my useage, but I'm trying.

    What are you doing to help kill bash?

smarx007 16 hours ago

Is it coming to Google Jules?

beezlewax 19 hours ago

Can't wait til Gemini 4 is out!

guluarte a day ago

it is live in the api

> gemini-3-pro-preview-ais-applets

> gemini-3-pro-preview

  • spudlyo a day ago

    Can confirm. I was able to access it using GPTel in Emacs using 'gemini-3-pro-preview' as the model name.

vivzkestrel 21 hours ago

has anyone managed to use any of the AI models to build a complete 3D fps game using web GL or open GL?

  • kridsdale3 19 hours ago

    I made a webgl copy of wolfenstein with prompt engineering in browser-based "Make a website" tool that was gemini-powered.

    • vivzkestrel 7 hours ago

      mind sharing what tool that was that lets you run gemini on the browser in interactive mode to make games?

chiragsrvstv 11 hours ago

Waiting for google to nuke this as well just like 2.5pro

lofaszvanitt 9 hours ago

Oh that corpulent fella with glasses who talks in the video. Look how good mannered he is, he can't hurt anyone. But Google still takes away all your data and you will be forced out of your job.

lofaszvanitt 9 hours ago

A tad bit better, still has the same issues regarding unpacking and understanding complex prompts. I have a test of mine and now it performs a bit better, but still, it has zero understanding what is happening and for why. Gemini is the best of the best model out there, but with complex problems it just goes down the drain :(.

catigula 20 hours ago

The problem with experiencing LLM releases nowadays is that it is no longer trivial to understand the differences in their vast intelligences so it takes awhile to really get a handle on what's even going on.

dankobgd a day ago

every day, new game changer

hubraumhugo a day ago

No gemini-3-flash yet, right? Any ETA on that mentioned? 2.5-flash has been amazing in terms of cost/value ratio.

  • 8note 21 hours ago

    ive found gemini 2.5-flash works better (for.agentic coding) than pro, too

t_minus_40 19 hours ago

is there even a puzzle or math problem gemini 3 cant solve?

Dquiroga 19 hours ago

I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".

Gemini 3:

The cognitive dissonance in this thread is staggering.

We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.

Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":

1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.

2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.

Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.

Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?

  • jaapz 19 hours ago

    Gotta hand it to gemini, those are some top notch points

    • fwip 18 hours ago

      The "Model card leak" point is worth negative points though, as it's clearly a misreading of reality.

    • Dquiroga 19 hours ago

      yeah hahahahah, it made me think!

  • BoorishBears 18 hours ago

    > We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.

    That feels like something between a hallucination and an intentional fallacy that popped up because you specifically said "intense discussion". The increase is 60% on input tokens from the old model, but it's not a markup, and especially not "sold back to us at X markup".

    I've seen more and more of these kinds of hallucinations as these models seem to be RL'd to not be a sycophant, they're slowly inching into the opposite direction where they tell small fibs or embellish in a way that seems like it's meant to add more weight to their answers.

    I wonder if it's a form of reward hacking, since it trades being maximally accurate for being confident, and that might result in better rewards than being accurate and precise

    • Dquiroga 4 hours ago

      60% probably felt like a lot to Gemini. However, I liked the doomerism and how google was using our data to train its models.

      Nonetheless, Gemini 3 failed this test. It failed to start a discussion. Its points were shallow, and too aiesque.

sylware 18 hours ago

Trained models should be able to use formal tools (for instance a logical solver, a computer?).

Good. That said, I wonder if those models are still LLMs.

BoorishBears 20 hours ago

So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it

Glad to see Google still can't get out of its own way.

testfrequency 20 hours ago

I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.

Yes, I know the Workspaces workaround, but that’s silly.

m3kw9 21 hours ago

If it ain't quantum leap, new models are just "OS updates".

Der_Einzige a day ago

When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.

Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.

I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...

XCSme 21 hours ago

How's the pelican?

skerit a day ago

Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?

  • CjHuber a day ago

    Honestly I liked 2.5 Pro preview much more than the final version

pk-protect-ai 5 hours ago

It is pointless to ask an LLM to draw an ASCII unicorn these days. Gemini 3 draws one of these (depending on the prompt):

https://www.ascii-art.de/ascii/uvw/unicorn.txt

However, it is amazing how far spatial comprehension has improved in multimodal models.

I'm not sure the below would be properly displayed on HN; you'll probably need to cut and paste it into a text editor.

Prompt: Draw me an ASCII world map with tags or markings for the areas and special places.

Temperature: 1.85

Top-P 0.98

Answer: Edit (replaced with URL) https://justpaste.it/kpow3

thrownaway561 18 hours ago

yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.

kmeisthax 10 hours ago

The most devastating news out of this announcement is that Vending-Bench 2 came out and it has significantly less clanker[0] meltdowns than the first one. I mean, seriously? Not even one run where the model tried to stock goods that hadn't arrived yet, only for it to eventually try and fail to shut down the business, and then e-mail the FBI about the $2 daily fee being deducted from the bot?

[0] Fake racial slur for a robot, LLM chatbot, or other automated system

gabriel666smith 16 hours ago

Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.

This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.

An example output:

'''

The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.

He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.

"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."

He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.

The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.

Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.

Three red dots appeared on his chest. They made a small triangle.

Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.

"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."

The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.

"But what I do have," Bryan said, "are a very particular set of skills."

The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.

"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.

The thing roared. It wasn't a human sound. It was a wet, guttural noise.

"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."

The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.

"And I will kill you," Bryan said.

He raised the Glock and fired twice. Pop-pop.

The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.

Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.

Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.

"You're big," Bryan whispered. "That makes you slow."

He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.

The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.

The blade hit the creature in the knee. There was a snap. A loud, wet snap.

The thing howled. It fell onto one knee.

Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.

The creature swung its arm back. It hit Bryan in the chest.

Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.

He sat up. His ribs hurt. Maybe broken.

The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.

Bryan knew a bomb when he saw one.

"No," Bryan said.

He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.

He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.

Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.

Silence came back.

The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.

The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.

"...good luck..."

Then it died. It just stopped.

Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.

His glass of scotch was still sitting there, untouched. The ice hadn't even melted.

He picked it up. He took a drink. It still tasted cheap.

He took his phone out and looked at it. No service.

"Well," he said.

He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.

'''

otikik 18 hours ago

… agentic …

Meh, not interested already

casey2 a day ago

The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.

recitedropper a day ago

[flagged]

  • sometimes_all a day ago

    Wow, you weren't wrong...

    • bryanlarsen a day ago

      It's the only comment referencing AGI. Seems wrong to me.

      • recitedropper a day ago

        I'm primarily reacting to the other threads, like the one that leaked the system card early. And, perhaps unfairly, Twitter as well.

        • andy12_ a day ago

          You might not believe this, but there are a lot of people (me included) that were extremely excited about the Gemini 3 release and are pleased to see the SOTA benchmark results, and this is reflected in the comments.

          • recitedropper a day ago

            I definitely believe it--I'm not a total AI hater. The jump on the screen usage benchmark is really exciting in that it might substantially help computer-use agentic workflows.

            That said, I think there is too much a pattern with recent model releases around what appears to me to be astroturfing to get to HN front page. Of course that doesn't preclude many organic comments that are excited too!

            A bit of both always happens. But given how important these model releases are to justify the capex and levels of investment, I think it is pretty clear the various "front pages" of our internet are manipulated. The incentive is just too strong not to.

        • bryanlarsen a day ago

          There are approximately 300 comments on the half dozen or so posts on the front page about Gemini at the moment. 2 threads reference AGI, one of them this one.

          • recitedropper a day ago

            Perhaps I shouldn't have implied an expectation of lots of explicit mentions of "AGI". It is more the general sentiments being expressed, and the extent to which critical takes seem to be quickly buried.

            I'm totally open to being wrong though. Maybe the tech community is just that excited about Gemini 3's release.

            • bryanlarsen 20 hours ago

              HN doesn't seem particularly excited.

              On most of the front pages, negative sentiments have floated to the top, especially the Antigravity pages.

              • recitedropper 20 hours ago

                Not sure if this is agreeing or disagreeing with there being astroturfing.

                But I'd reckon that the negative sentiments at the top, combined with that there are over eight Gemini 3 posts on the front page recently, is good evidence of manipulation. This actually might be the most posted about model release this year, and if people were that excited we wouldn't have negative sentiment abound.

  • 63stack a day ago

    I noticed this as well, you are already downvoted into gray

    • wasabi991011 a day ago

      They're downboted into grey because it's complaining about the future of this thread before it has even happened. Also it's conspiratorial, without much evidence.

    • recitedropper a day ago

      Inevitable... certainly more so than AGI :)

    • paganel a day ago

      That used to be the case even before when Alphabet/Apple/Meta were negatively commented upon, I used to blame it in many of the users here (and who also happen to work for those companies) not wanting to see their total comps go down, but this right here I think that can squarely be blamed on AI-bots.

    • hobofan a day ago

      And now it's flagged.

      I think this is one of HNs biggest weaknesses. If you are a sufficiently large engineering organization with enough employees that pass the self-moderation karma thresholds, you can essentially strike down any significantly critical discussion.

      • zamadatix a day ago

        Without a public moderation log (i.e. even user flags being part of the log) claims like this will always come up but to me it always seems more likely just the early commenting users tired of being told they are part of some astroturf campaign and if they don't flock to agree with the OPs views it must just be more proof.

        I'm sure both reasons happen to some degree, just as a matter of how often is actual astroturfing vs "a small percentage of active people can't possibly just have different thoughts than me".

jennyholzer 20 hours ago

"AI" benchmarks are and have consistently been lies and misinformation. Gemini is dead in the water.

rvz a day ago

I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:

> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.

So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.

Where is the outrage?

[0] https://web.archive.org/web/20251118111103/https://storage.g...

[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...

  • inkysigma a day ago

    Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause.

  • stefs a day ago

    i'm very doubtful gmail mails are used to train the model by default, because emails contain private data and as soon as this private data shows up in the model output, gmail is done.

    "gmail being read by gemini" does NOT mean "gemini is trained on your private gmail correspondence". it can mean gemini loads your emails into a session context so it can answer questions about your mail, which is quite different.

  • recitedropper a day ago

    I'm pretty sure they mention in their various TOSes that they don't train on user data in places like Gmail.

    That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.

    If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.

  • Yizahi 21 hours ago

    By the year 2025 I think most of the HN regulars and IT people in general are so jaded regarding privacy that it is not even surprising anyone. I suspect all gmails were analyzed and read from the beginning of google age, so nothing really changed, they might as well just admit it.

    Google is betting that moving email and cloud is such a giant hassle that almost no one will do it, and ditching YT and Maps is just impossible.

  • aoeusnth1 a day ago

    This seems like a dubious conclusion. I think you missed this part:

    > in accordance with Google’s relevant terms of service, privacy policy

nextworddev 21 hours ago

It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.

Also probably over for OpenAI

mihau a day ago

@simonw wen pelican

poemxo 21 hours ago

It's amazing to see Google take the lead while OpenAI worsens their product every release.

alksdjf89243 19 hours ago

Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.

WXLCKNO 21 hours ago

Valve could learn from Google here

informal007 a day ago

It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com

kachapopopow 19 hours ago

It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.

the token efficiency and context is also mindblowing...

it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.

gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.

vlmrun-admin 18 hours ago

https://www.youtube.com/watch?v=cUbGVH1r_1U

Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.

We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.

The result? It struggled where I didn't expect it to.

Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.

Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?

  • acoustics 18 hours ago

    Don't self-promote without disclosure.

irthomasthomas a day ago

I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story. https://xcancel.com/xundecidability/status/19908286970881311...

Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...

  • stickfigure 21 hours ago

    He's not a central figure in the narrative, he's a background character. Things he created (MIRI, CFAR, LessWrong) are important to the narrative, the founder isn't. If I had to condense the article, I'd probably cut him out too. Summarization is inherently lossy.

    • irthomasthomas 21 hours ago

        > Eliezer Yudkowsky is a central figure in the article, mentioned multiple times as the intellectual originator of the community from which the "Zizians" splintered. His ideas and organizations are foundational to the entire narrative.
      • stickfigure 21 hours ago

        And yet you could eliminate him entirely and the story is still coherent.

        The story isn't about Yudkowsky. At each level of summarization you have to make hard decisions about what to keep. Not every story about the United States needs to mention George Washington.

      • Dilettante_ 20 hours ago

        You're absolutely right! The AI said it, so it must be true!

        • irthomasthomas 20 hours ago

          At least read what you respond to... Imagine thinking Yudkowsky was NOT a central figure in the Zizians story.

          • Dilettante_ 20 hours ago

            You literally quoted the LLMs output verbatim as your proof.

            Edit: And upon skimming the article at the points where Yudkowsky's name is mentioned, I 100% agree with stickfigure.

            I challenge you to name one way in which the story falls apart without the mention of Yudkowsky.

            • irthomasthomas 19 hours ago

              It sounds like both of you are unfamiliar with the link between the Zizians and Yudkowsky. So let us just return to the discussion of gemini-3, do you think the model did a bad job then in it's second response?

              • Dilettante_ 19 hours ago

                It literally does not matter how much they are connected out here in reality, the AI was to summarize the information in the article and that is exactly what it did.

                >do you think the model did a bad job then in it's second response

                Yes, very obviously it told you what you wanted to hear. This is behavior that should not be surprising to you.

                • irthomasthomas 18 hours ago

                  Why do you think I obviously wanted to hear that?

                  • Dilettante_ 18 hours ago

                    It's implicit in your prompt!

                      "Wtf - no mention of Yudkowsky?"
                    
                    Also that is the position you've been defending this whole thread. This whole conversation is happening because you believe Yudkowsy is an important figure to the story.
                    • irthomasthomas 17 hours ago

                      Here's another attempt: llm --cid 01kabxtjq10exgk56yf802028f "I notice you did not mention Yudkowsky?" --no-log -m gem3-t1 Based on the text provided, Eliezer Yudkowsky is a central background figure to this story, serving as the intellectual progenitor of the movement from which the Zizians splintered.

                      Here is specifically how he features in the article:

                      * *The Founder:* Yudkowsky is identified as the founder of the Machine Intelligence Research Institute (MIRI) and the Center for Applied Rationality (CFAR), as well as the founder of the LessWrong forum. These are the institutions the Zizians originally belonged to and eventually declared war against. * *The Source Code:* The Zizians based their radical worldview on Yudkowsky’s concepts, specifically *"timeless decision theory."* However, they believed they were applying his logic more rigorously than he was. They became convinced that Yudkowsky’s organizations had betrayed these principles (specifically regarding a rumor about MIRI paying blackmail to cover up a scandal), which they viewed as a moral failing that justified their rebellion.

                      • Dilettante_ 41 minutes ago

                        I've done as much fiddling and prompting to LLMs about that article as I cared to do under these circumstances and I have to concede the point about you getting 'the answer you wanted' out: The chatbots were quite insistent that Yudkowski is central to the story, even when I pulled out the following: "Somebody is arguing Yudkowsky is a central figure in this article, is that accurate?"

                        They are *wrong*, and provided exactly the same immaterial evidence as you did in this thread(I still insist that the article suffers zero damage if you remove Yudkowsky from it and instead only mention the institutions and concepts that stem from him), but with all the behavior I've seen now, the summary which was the initial issue of this thread should have included him.

                        [What I would've really liked to do was to prompt for another person of equal non-prominence who was in the article, but not in the summary, and see what comes up. But I sure am not reading the 80-102 minute article just for this and we're unlikely to find an agreement about the 'equal non-prominence' part if I challenged you to pick one.]

  • gregsadetsky 21 hours ago

    Interesting, yeah! Just tried "summarize this story and list the important figures from it" with Gemini 2.5 Pro and 3 and they both listed 10 names each, but without including Yudkowsky.

    Asking the follow up "what are ALL the individuals mentioned in the story" results in both models listing ~40 names and both of those lists include Yudkowsky.

  • briga 21 hours ago

    Maybe it has guard rails against such things? That would be my main guess on the Zizian one.