What we can reasonably assume from statements made by insiders:
They want a 10x improvement from scaling and a 10x improvement from data and algorithmic changes
The sources of public data are essentially tapped
Algorithmic changes will be an unknown to us until they release, but from published research this remains a steady source of improvement
Scaling seems to stall if data is limited
So with all of that taken together, the logical step is to figure out how to turn compute into better data to train on. Enter strawberry / o1, and now o3
They can throw money, time, and compute at thinking about and then generating better training data. If the belief is that N billion new tokens of high quality training data will unlock the leap in capabilities they’re looking for, then it makes sense to delay the training until that dataset is ready
With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.
At this point I would guess we get 4.5 with a subset of this - some scale improvement, the algorithmic pickups since 4 was trained, and a cleaned and improved core data set but without risking leakage of the superior dataset
When 5 launches, we get to see what a fully scaled version looks like with training data that outstrips average humans in almost every problem space
Then the next o-model gets to start with that as a base and reason? Its likely to be remarkable
"With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field."
I highly doubt that. o3 is many orders of magnitude more expensive than paying subject matter experts to create new data. It just doesn't make sense to pay six figures in compute to get o3 to make data a human could make for a few hundred dollars.
Yes, I think they had to push this reveal forward because their investors were getting antsy with the lack of visible progress to justify continuing rising valuations. There is no other reason a confident company making continuous rapid progress would feel the need to reveal a product that 99% of companies worldwide couldn't use at the time of the reveal.
That being said, if OpenAI is burning cash at lightspeed and doesn't have to publicly reveal the revenue they receive from certain government entities, it wouldn't come as a surprise if they let the government play with it early on in exchange for some much needed cash to set on fire.
EDIT: The fact that multiple sites seem to be publishing GPT-5 stories similar to this one leads one to conclude that the o3 benchmark story was meant to counter the negativity from this and other similar articles that are just coming out.
> With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.
Even taking OpenAI and the benchmark authors at their word they said that it is consuming at least tens of dollars per task to hit peak performance, how much would it cost to have it produce a meaningfully large training set?
There is no public API for o3 yet, those are the numbers they revealed in the ARC-AGI announcement. Even if they were public API prices we can't assume they're making a profit on those for as long as they're billions in the red overall every year, its entirely possible that the public API prices are less than what OpenAI is actually paying.
I don't think oai has any moat at all. If you look around, QwQ from Alibaba is already pushing o1-preview performances. I think oai is only ahead by 3~6 months at most.
Everyone's obsessed with new training tokens... It doesn't need to be more knowledgeable, it just needs to practice more. Ask any student: practice is synthetic data.
Overfitting can be caused by a lot of different things. Having an over abundance of one kind of data in a training set is one of those causes.
It’s why many pre-processing steps for image training pipelines will add copies of images at weird rotations, amounts of blur, and different cropping.
> The more concepts the model manages to grok, the more nonlinear its capabilities will be
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So earlier when I was referring to compounding bias in synthetic data I was referring to a bias that gets trained on over and over and over again.
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So, here's my hypothesis, as someone who is adjacent ML but haven't trained DNNs directly:
We don't understand how they work, because we didn't build them. They built themselves.
At face value this can be seen as an almost spiritual position, but I am not a religious person and I don't think there's any magic involved. Unlike traditional models, the behavior of DNNs is based on random changes that failed up. We can reason about their structure, but only loosely about their functionality. When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers. Given this, there will not be a direct correlation between inputs and capabilities, but some arrangements do work better than others.
If this is the case, high order capabilities should continue to increase with training cycles, as long as they are performed in ways that don't interfere with what has been successfully learned. People lamented the loss of capability that GPT 4 suffered as they increased safety. I think Anthropic has avoided this by choosing a less damaging way to tune a well performing model.
> We don't understand how they work, because we didn't build them. They built themselves.
We do understand how they work, we did build them.
The mathematical foundation of these models are sound. The statistics behind them are well understood.
What we don’t exactly know is which parameters correspond to what results as it’s different across models.
We work backwards to see which parts of the network seem to relate to what outcomes.
> When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers.
Isn’t this the exact opposite of reality?
They get better at drawing because we improve their datasets, topologies, and their training methods and in doing so, teach them to draw.
They get better at reasoning because the engineers and data scientists building training sets do get better at philosophy.
They study what reasoning is and apply those learnings to the datasets and training methods.
synthetic data is fine if you can ground the model somehow. that's why the o1/o3's improvements are mostly in reasoning, maths, etc., because you can easily tell if the data is wrong or not.
The value of synthetic data relies on having non-zero signal about which generated data is "better" or "worse". In a sense, this what reinforcement learning is about. Ie, generate some data, have that data scored by some evaluator, and then feed the data back into the model with higher weight on the better stuff and lower weight on the worse stuff.
The basic loop is: (i) generate synthetic data, (ii) rate synthetic data, (iii) update model to put more probability on better data and less probability on worse data, then go back to (i).
Counterpoint: o1-Pro is insanely good -- subjectively, it's as far above GPT4 as GPT4 was above 3. It's almost too good. Use it properly for an extended period of time, and one begins to worry about the future of one's children and the utility of their schooling.
o3, by all accounts, is better still.
Seems to me that things are progressing quickly enough.
Not sure what you are using it for, but it is terrible for me for coding; claude beats it always and hands down. o1 just thinks forever to come up with stuff it already tried the previous time.
People say that's just prompting without pointing to real million line+ repositories or realistic apps to show how that can be improved. So I say they are making todo and hello world apps and yes, there it works really well. Claude still beats it, every.. single.. time..
And yes, I use the Pro of all and yes, I do assume coding is done for most of people. Become a plumber or electrician or carpenter.
That so weird, it’s seems like everybody here prefers Claude.
I’ve been using Claude and openai in copilot and I find even 4o seems to understand the problem better. O1 definitely seems to get it right more for me.
I try to sprinkle 'for us/me' everywhere as much as I can; we work on LoB/ERP apps mostly. These are small frontends to massive multi million line backends. We carved a niche by providing the frontends on these backends live at the client office by a business consultant of ours: they simply solve UX issues for the client on top of large ERP by using our tool and prompting. Everything looks modern, fresh and nice; unlike basically all the competitors in this space. It's fast and no frontend people are needed for it; backend is another system we built which takes a lot longer of course as they are complex business rules. Both claude and o1 turn up something that looks similar but only the claude version will work and be, after less prompting, correct. I don't have shares in either and I want open source to win; we have all open (more open) solutions doing all the same queries and we evaluate all but claude just wins. We did manage even big wins with openai davinci in 2022 (or so; before chatgpt), but this is a massive boost allowing us to upgrade most people to business consultant and just have them build with clients real time and have the tech guys including me add manually tests and proofs (where needed) to know if we are actually fine. Works so much better than the slog with clients before; people are so bad at explaining at what they need, it was slowly driving me insane after doing it for 30+ years.
They're both okay for coding, though for my use cases (which are niche and involve quite a lot of mathematics and formal logic) o1/o1-Pro is better. It seems to have a better native grasp of mathematical concepts, and it can even answer very difficult questions from vague inputs, e.g.: https://chatgpt.com/share/676020cb-8574-8005-8b83-4bed5b13e1...
Different languages maybe? I find Sonnet v2 to be lacking in Rust knowledge compared to 4o 11-20, but excelling at Python and JS/TS. O1's strong side seems to be complex or quirky puzzle-like coding problems that can be answered in a short manner, it's meh at everything else, especially considering the price. Which is understandable given its purpose and training, but I have no use for it as that's exactly the sort of problem I wouldn't trust an LLM to solve.
Sonnet v2 in particular seems to be a bit broken with its reasoning (?) feature. The one where it detects it might be hallucinating (what's even the condition?) and reviews the reply, reflecting on it. It can make it stop halfway into the reply and decide it wrote enough, or invent some ridiculous excuse to output a worse answer. Annoying, although it doesn't trigger too often.
We do the same (all requests go to o1, sonnet and gemini and we store the results for later to compare) automatically for our research: Claude always wins. Even with specific prompting on both platforms. Especially frontend it seems o1 really is terrible.
Every time I try Gemini, it's really subpar. I found that qwen2.5-coder-32b-instruct can be better.
Also, for me 50% 50% for Sonnet and o1, but although I'm not 100% sure about it, I think o1 is better with longer and more complicated (C++) code and debugging. At least from my brief testing. Also, OpenAI models seem to be more verbose - sometimes it's better - where I'd like additional explanation on chosen fields in a SQL schema, sometimes it's too much.
EDIT: Just asked both o1 and Sonnet 3.5 the same QML coding question, and Sonnet 3.5 succeeded, o1 failed.
Wins? What does this mean? Do you have any results? I see the claims that Claude is better for coding a lot but using it and using Gemini 2.0 flash and o1 and it sure doesn't seem like it.
I keep reading this on HN so I believe it has to be true in some ways, but I don't really feel like there is any difference in my limited use (programming questions or explaining some concepts).
If anything I feel like it's all been worse compared to the first release of ChatGPT, but I might be wearing rose colored glasses.
It’s the same for me. I genuinely don’t understand how I can be having such a completely different experience from the people who rave about ChatGPT. Every time I’ve tried it’s been useless.
How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML. It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
One guy I work with uses it extensively and what it produces is essentially black boxes. If I find a problem with something “he” (or rather ChatGPT) has produced it takes him ages to commune with the machine spirit again to figure out how to fix it, and then he still doesn’t understand it.
I can’t help but see this as a time-bomb, how much completely inscrutable shite are these tools producing? In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Before people cry “o tempora o mores” at me and make parallels with the introduction of high-level languages, at least in order to write in a high-level language you need some basic understanding of the logic that is being executed.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch?
There are a lot of code monkeys working on boilerplate code, these people used to rely on stack overflow and now that chatgpt is here it's a huge improvement for them
If you work on anything remotely complex or which hasn't been solved 10 times on stack overflow chatgpt isn't remotely as useful
first time I tried it, I asked it to find bugs in a piece of very well tested C code.
It introduced an off-by-one error by miscounting the number of arguments in an sprintf call, breaking the program. And then proceeded to fail to find that bug that it introduced.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML.
Part of this is, I think, anchoring and expectation management: you hear people say it's amazing and wonderful, and then you see it fall over and you're naturally disappointed.
My formative years started off with Commodore 64 basic going "?SYNTAX ERROR" from most typos plus a lot of "I don't know what that means" from the text adventures, then Metrowerks' C compiler telling me there were errors on every line *after but not including* the one where I forgot the semicolon, then surprises in VisualBasic and Java where I was getting integer division rather than floats, then the fantastic oddity where accidentally leaning on the option key on a mac keyboard while pressing minus turns the minus into an n-dash which looked completely identical to a minus on the Xcode default font at the time and thus produced a very confusing compiler error…
So my expectations have always been low for machine generated output. And it has wildly exceeded those low expectations.
But the expectation management goes both ways, especially when the comparison is "normal humans" rather than "best practices". I've seen things you wouldn't believe...
Entire files copy-pasted line for line, "TODO: deduplicate" and all,
20 minute app starts passed off as "optimized solutions."
FAQs filled with nothing but Bob Ross quotes,
a zen garden of "happy little accidents."
I watched iOS developers use UI tests
as a complete replacement for storyboards,
bi-weekly commits, each a sprawling novel of despair,
where every change log was a tragic odyssey.
Google Spreadsheets masquerading as bug trackers,
Swift juniors not knowing their ! from their ?,
All those hacks and horrors… lost in time,
Time to deploy.
(All true, and all pre-dating ChatGPT).
> It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
Aye. I've even had that with models forgetting the APIs they themselves have created, just outside the context window.
To me, these are tools. They're fantastic tools, but they're not something you can blindly fire-and-forget…
…fortunately for me, because my passive income is not quite high enough to cover mortgage payments, and I'm looking for work.
> In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Yes, if we're lucky.
If we're not, the models keep getting better and we don't have any "senior engineers" at all.
The ones who use it extensively are the same that used to hit up stackoverflow as the first port of call for every trivial problem that came their way. They're not really engineers, they just want to get stuff done.
Same, on every release from openai, anthropic I keep reading how the new model is so much better (insert hyperbole here) than the previous one yet when using it I feel like they are mostly the same as last year.
One use-case: They help with learning things quickly by having a chat and asking questions. And they never get tired or emotional. Tutoring 24/7.
They also generate small code or scripts, as well as automate small things, when you're not sure how, but you know there's a way. You need to ensure you have a way to verify the results.
They do language tasks like grammar-fixing, perfect translation, etc.
They're 100 times easier and faster than search engines, if you limit your uses to that.
They can't help you learn what they don't know themselves.
I'm trying to use them to read historical handwritten documents in old Norwegian (Danish, pretty much). Not only do they not handle the German-style handwriting, but what they spit out looks like the sort of thing GPT-2 would spit out if you asked it to write Norwegian (only slightly better than Swedish Muppet Swedish Chef's Swedish). It seems the experimental tuning has made it worse at the task I most desperately want to use it for.
And when you think about it, how could it not overfit in some sense, when trained on its own output? No new information is coming in, so it pretty much has to get worse at something to get better at all the benchmarks.
Hah, no. They're good, but they definitely make stuff up when the context gets too long. Always check their output, just the same as you already note they need for small code and scripts.
If you've ever used any enterprise software for long enough, you know the exact same song and dance.
They release version Grand Banana. Purported to be approximately 30% faster with brand new features like Algorithmic Triple Layering and Enhanced Compulsory Alignment. You open the app. Everything is slower, things are harder to find and it breaks in new, fun ways. Your organization pays a couple hundred more per person for these benefits. Their stock soars, people celebrate the release and your management says they can't wait to see the improvement in workflows now that they've been able to lay off a quarter of your team.
Has there been improvements in LLMs over time? Somewhat, most of it concentrated at the beginning (because they siphoned up a bunch of data in a dubious manner). Now it's just part of their sales cycle, to keep pumping up numbers while no one sees any meaningful improvement.
I had a 30 min argument with o1-pro where it was convinced it had solved the halting problem. Tried to gaslight me into thinking I just didn’t understand the subtlety of the argument. But it’s susceptible to appeal to authority and when I started quoting snippets of textbooks and mathoverflow it finally relented and claimed there had been a “misunderstanding”. It really does argue like a human though now...
I had a similar experience with regular o1 about integral that was divergent. It was adamant that it wasn't and would respond to any attempt at persuasion with variants of "its a standard integral" with a "subtle cancellation". When I asked for any source for this standard integral it produced references to support its argument that existed but didn't actually contain the integral. When I told it the references didn't have the result and backpedalled (gaslighting!) to "I never told you they were in there". When I pointed out that in fact it did it insisted this was just a "misunderstanding". It only relented when I told it Mathematica agreed the integral was divergent. It still insisted it never said that the books it pointed to contained this (false, non-sensical) result.
This was new behaviour for me to see in an LLM. Usually the problem is these things would just fold when you pushed back. I don't know which is better, but being this confidently wrong (and "lying" when confronted with it) is troubling.
I was on an airplane and there was high-speed Internet on the airplane. That's the newest thing that I know exists. And I'm sitting on the plane and they go, open up your laptop, you can go on the Internet.
And it's fast, and I'm watching YouTube clips. It's amazing. I'm on an airplane! And then it breaks down. And they apologize, the Internet's not working. And the guy next to me goes, 'This is bullshit.' I mean, how quickly does the world owe him something that he knew existed only 10 seconds ago?"
Soon, all the middle class jobs will be converted to profits for the capital/data center owners, so they have to spend while they can before the economy crashes due to lack of spending.
Not invariably. Some of those people are the ones who want to draw 7 red lines all perpendicular, some with green ink, some with transparent and one that looks like a kitten.
No, people who say "it's bullshit" and then do something to fix the bullshit are the ones that push technology forward. Most people who say "it's bullshit" instantly when something isn't perfect for exactly what they want right now are just whingers and will never contribute anything except unconstructive criticism.
Phrased another way, the world is still trying to figure out what this technology is actually good for besides generating spam. The "schedule" is finding an actual rationalization for the billions of dollars being pumped into it before the bubble pops.
There's someone with this comment in every thread. Meanwhile, no one answers this because they are getting value. Please take the time to learn, it will give you value.
I’m a consultant. Having looked at several enterprises, there’s a lot of work being done to make a lot of things that don’t really work.
The bigger the ambition, the harder they’re failing. Some well designed isolated use cases are ok. Mostly things about listening and summarizing text to aid humans.
I have yet to see a successful application that is generating good content. IMO replacing the first draft of content creation and having experts review and fix it is, like, the stupidest strategy you can do. The people you replace are the people at the bottom of the pyramid who are supposed do this work to upskill and become domain experts so they can later review stuff. If they’re no longer needed, you’re going to one day lose your reviewer, and with it, the ability to assess your generated drafts. It’s a foot gun.
I mean, no, not generally. but the success rate of other tools is much higher.
A lot of companies are trying to build these general purpose bots that just magically know everything about the company and have these but knowledge bases, but they just don’t work.
I'm someone who generally was a "doubter", but I've dramatically softened my stance on this topic.
Two things:
I was casually watching Andreas Kling's streams on Ladybird development (where he was developing a JIT compiler for JS) and was blown away at the accuracy of completions (and the frequency of those completions)
Prior to this, I'd only ever copypasta'd code from ChatGPT output on occasion.
I started adopting the IDE/Editor extensions and prototyping small projects.
There's now small tools and utilities I've written that I'd not have written otherwise, or would have taken twice the time invested had I'd not used these tools.
With that said, they'd be of no use without oversight, but as a productivity enhancement, the benefits are enormous.
> Meanwhile, no one answers this because they are getting value.
You're literally doing the same thing you're accusing of. Every HN thread is full of AI boosters claiming AI to be the future with no backing evidence.
Riddle me this. If all these people are "getting value", why are all these companies losing horrendous amounts of money? Why has nobody figured out how to be profitable?
> Please take the time to learn, it will give you value.
Yeah, yeah, just prompt engineer harder. That'll make the stochastic parrot useful. Anyone who has criticism just does so because they're dumb and you're smart. Same as it always was. Everyone opposed to the metaverse just didn't get it bro. You didn't get NFTs bro. You didn't get blockchain bro.
None of these previous bubbles had money in it (beyond scamming idiots), if AI wants to prove it's not another empty tech bubble, pay up. Show me the money. Should be easy, if it's automating so many expensive man-hours of labour. People would be lining up to pay OpenAI.
> Riddle me this. If all these people are "getting value", why are all these companies losing horrendous amounts of money? Why has nobody figured out how to be profitable?
While I agree that LLMs are not currently working great for most envisioned use cases; this premise here is not a good argument. Large LLM providers are not trying to be profitable at the moment. They’re trying to grow and that’s pretty sensible.
Uber was the poster child of this, and for all its mockery, Uber is now an unqualified profitable company.
> Why has nobody figured out how to be profitable?
From what I've seen claimed about OpenAI finances, this is easy: It's a Red Queen's race — "it takes all the running you can do, to keep in the same place".
If their financial position was as simple as "we run this API, we charge X, the running cost is Y", then they're already at X > Y.
But if that was all OpenAI were actually doing, they'd have stopped developing new versions or making the existing models more efficient some time back, while the rest of the industry kept improving their models and lowering their prices, and they'd be irrelevant.
> People would be lining up to pay OpenAI.
They are.
Not that this is either sufficient or necessary to actually guarantees anything about real value. For lack of sufficiency: people collectively paid a lot for cryptocurrencies and NFTs, too (and before then and outside tech, homeopathic tinctures and sub-prime mortgages); For lack of necessity: there's plenty of free-to-download models.
I get a huge benefit even just from the free chat models. I could afford to pay for better models, but why bother when free is so good? Every time a new model comes out, the old paid option becomes the new free option.
That was puzzles me now. Everyone with a semblance of expertise in engineering knows that if you start with a tool and try to find a problem it could solve you are doing it wrong. The right way is the opposite - you start with a problem, and find the best tool to solve it, and if it's the new shiny tool - so be it, but most of the time it's not.
Except the whole tech world starting with the CEOs seems to do it the "wrong" way with LLMs. People and whole companies are encouraged to find what these things might be actually useful for.
• Build toys that would otherwise require me to learn new APIs (I can read python, but it's not my day job)
• Learn new things like OpenSCAD
• To improve my German
• Learn about the world by allowing me to take photos of things in this world that I don't understand and ask them a question about the content, e.g. why random trees have bands or rectangles of white paint on them
• Help me shopping, by taking a photo of the supermarket that I happen to be in at the time and ask them where I should look for some item I can't find
• Help with meal prep, by allowing me to get a recipe based on what food and constraints I've got at hand rather than the traditional method of "if you want x, buy y ingredients"
Even if they're just an offline version of Wikipedia or Google, they're already a more useful interface for the same actual content.
For a company that sees itself as the undisputed leader and that wants to raise $7 trillion to build fabs, they deserve some of the heaviest levels of scrutiny in the world.
If OpenAI's investment prospectus relies on them reaching AGI before the tech becomes commoditized, everyone is going to look for that weakness.
Everyone's comparing o1 and claude, but neither really work well enough to justify paying for them in my experience for coding. What I really want is a mode where they ask clarifying questions, ideally many of them, before spitting out an answer. This would greatly improve utility of producing something with more value than an auto-complete.
Just tell it to do that and it will. Whenever I ask an AI for something and I'm pretty sure it doesn't have all the context I literally just say "ask me clarifying questions until you have enough information to do a great job on this."
And this chain of prompts cumulated with the improved CoT reasoner would accrue a lot more enhanced results. More in line with what the coming agentic era promises.
Yes. You can only do so much with the information you get in. The ability to ask good questions, not just of itself in internal monologue style, but actually of the user, would fundamentally make it better since it can get more information in.
As it is now, it has a bad habit of, if it can't answer the question you asked, instead answering a similar-looking question which it thinks you may have meant. That is of course a great strategy for benchmarks, where you don't earn any points for saying you don't know. But it's extremely frustrating for real users, who didn't read their question from a test suite.
I know multiple people that carefully prompt to get that done. The model outputs in direct token order, and can't turn around, so you need to make sure that's strictly followed. The system can and will come up with post-hoc "reasoning".
Just today I got Claude to convert a company’s PDF protocol specification into an actual working python implementation of that protocol. It would have been uncreative drudge work for a human, but I would have absolutely paid a week of junior dev time for it. Instead I wrote it alongside AI and it took me barely more than an hour.
The best part is, I’ve never written any (substantial) python code before.
I have to agree. It's still a bit hit or miss, but the hits are a huge time and money saver especially in refactoring. And unlike what most of the rather demeaning comments in those HN threads state, I am not some 'grunt' doing 'boilerplate work'. I mostly do geometry/math stuff, and the AIs really do know what they're talking about there sometimes. I don't have many peers I can talk to most of the time, and Claude is really helping me gather my thoughts.
That being said, I definitely believe it's only useful for isolated problems. Even with Copilot, I feel like the AIs just lack a bigger context of the projects.
Another thing that helped me was designing an initial prompt that really works for me. I think most people just expect to throw in their issue and get a tailored solution, but that's just not how it works in my experience.
It would seem you don't care too much about verifying its output or about its correctness. If you did, it wouldn't take you just an hour. I guess you'll let correctness be someone else's problem.
In my intuition it makes sense that there is going to be some significant friction in LLM development going forward. We're talking about models that will cost upwards of $1bn to train. Save for a technological breakthrough, GPT-6/7 will probably have to wait for hardware to catch up.
I think the main bottleneck right now is training data - they've basically exhausted all public sources of data, so they have to either pay humans to generate new data from scratch or pay for the reasoning models to generate (less useful) synthetic training data. The next bottleneck is hardware, and the least important bottleneck is money.
What I find odd is that o1 doesn't support attaching text documents to chats the way 4o does. For a model that specializes in reasoning, reading long documents seems like a natural feature to have.
If Sama ever reads this, I have no idea why no users seem to focus on this, but it would be really good to prioritise being able to select which model you can use with the custom myGPTs. I know this maybe hard or not possible without recreating them , but I still dont think it's possible.
I dont think most customers realise how much better the models work with custom GPTs.
"When using custom instructions or files, only GPT-4o is available". Straight out of the ChatGPT web interface when you try to select which model you want to use.
Train for what? For making videos? Train from people’s comments? There’s a lot of garbage on AI slop on youtube, how would this be sifted out? I think there’s more value here on HN in terms of training, but even that, to what avail?
YouTube is such a great multimodal dataset—videos, auto-generated captions, and real engagement data all in one place. That’s a strong starting point for training, even before you filter for quality. Microsoft’s Phi-series models already show how focusing on smaller, high-quality datasets, like textbooks, can produce great results. You could totally imagine doing the same thing with YouTube by filtering for high-quality educational videos.
Down the line, I think models will start using video generation as part of how they “think.” Picture a version of GPT that works frame by frame—ask it to solve a geometry problem, and it generates a sequence of images to visualize the solution before responding. YouTube’s massive library of visual content could make something like that possible.
How about just an updated gpt 4o with all newer data? It would go a long way. Currently it doesn't know anything since Oct 2023 (without having to do a web search).
What we can reasonably assume from statements made by insiders:
They want a 10x improvement from scaling and a 10x improvement from data and algorithmic changes
The sources of public data are essentially tapped
Algorithmic changes will be an unknown to us until they release, but from published research this remains a steady source of improvement
Scaling seems to stall if data is limited
So with all of that taken together, the logical step is to figure out how to turn compute into better data to train on. Enter strawberry / o1, and now o3
They can throw money, time, and compute at thinking about and then generating better training data. If the belief is that N billion new tokens of high quality training data will unlock the leap in capabilities they’re looking for, then it makes sense to delay the training until that dataset is ready
With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.
At this point I would guess we get 4.5 with a subset of this - some scale improvement, the algorithmic pickups since 4 was trained, and a cleaned and improved core data set but without risking leakage of the superior dataset
When 5 launches, we get to see what a fully scaled version looks like with training data that outstrips average humans in almost every problem space
Then the next o-model gets to start with that as a base and reason? Its likely to be remarkable
"With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field."
I highly doubt that. o3 is many orders of magnitude more expensive than paying subject matter experts to create new data. It just doesn't make sense to pay six figures in compute to get o3 to make data a human could make for a few hundred dollars.
Yes, I think they had to push this reveal forward because their investors were getting antsy with the lack of visible progress to justify continuing rising valuations. There is no other reason a confident company making continuous rapid progress would feel the need to reveal a product that 99% of companies worldwide couldn't use at the time of the reveal.
That being said, if OpenAI is burning cash at lightspeed and doesn't have to publicly reveal the revenue they receive from certain government entities, it wouldn't come as a surprise if they let the government play with it early on in exchange for some much needed cash to set on fire.
EDIT: The fact that multiple sites seem to be publishing GPT-5 stories similar to this one leads one to conclude that the o3 benchmark story was meant to counter the negativity from this and other similar articles that are just coming out.
Someone needs to dress up Mechanical Turk and repackage it as an AI company…..
That’s basically every AI company that existed before GPT3
That’s an interesting idea. What if OpenAI funded medical research initiatives in exchange for exclusive training rights on the research.
It would be orders of magnitude cheaper to outsource to humans.
Not as sexy to investors though
> With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.
Even taking OpenAI and the benchmark authors at their word they said that it is consuming at least tens of dollars per task to hit peak performance, how much would it cost to have it produce a meaningfully large training set?
That's the public API price isn't it?
There is no public API for o3 yet, those are the numbers they revealed in the ARC-AGI announcement. Even if they were public API prices we can't assume they're making a profit on those for as long as they're billions in the red overall every year, its entirely possible that the public API prices are less than what OpenAI is actually paying.
> OpenAI’s next moat
I don't think oai has any moat at all. If you look around, QwQ from Alibaba is already pushing o1-preview performances. I think oai is only ahead by 3~6 months at most.
I’m curious how, if at all, the plan to get around compounding bias in synthetic data generated by models trained in synthetic data.
Everyone's obsessed with new training tokens... It doesn't need to be more knowledgeable, it just needs to practice more. Ask any student: practice is synthetic data.
That leads to overfitting in ML land, which hurts overall performance.
We know that unique data improves performance.
These LLM systems are not students…
Also, which students graduate and are immediately experts in their fields? Almost none.
It takes years of practice in unique, often one-off, situations after graduation for most people to develop the intuition needed for a given field.
It's overfitting when you train too large a model on too many details. Rote memorization isn't rewarding.
The more concepts the model manages to grok, the more nonlinear its capabilities will be: we don't have a data problem, we have an educational one.
Claude 3.5 was safety trained by Claude 3.0, and it's more coherent for it. https://www.anthropic.com/news/claudes-constitution
Overfitting can be caused by a lot of different things. Having an over abundance of one kind of data in a training set is one of those causes.
It’s why many pre-processing steps for image training pipelines will add copies of images at weird rotations, amounts of blur, and different cropping.
> The more concepts the model manages to grok, the more nonlinear its capabilities will be
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So earlier when I was referring to compounding bias in synthetic data I was referring to a bias that gets trained on over and over and over again.
That leads to overfitting.
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So, here's my hypothesis, as someone who is adjacent ML but haven't trained DNNs directly:
We don't understand how they work, because we didn't build them. They built themselves.
At face value this can be seen as an almost spiritual position, but I am not a religious person and I don't think there's any magic involved. Unlike traditional models, the behavior of DNNs is based on random changes that failed up. We can reason about their structure, but only loosely about their functionality. When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers. Given this, there will not be a direct correlation between inputs and capabilities, but some arrangements do work better than others.
If this is the case, high order capabilities should continue to increase with training cycles, as long as they are performed in ways that don't interfere with what has been successfully learned. People lamented the loss of capability that GPT 4 suffered as they increased safety. I think Anthropic has avoided this by choosing a less damaging way to tune a well performing model.
I think these ideas are supported by Wolfram's reduction of the problem at https://writings.stephenwolfram.com/2024/08/whats-really-goi...
Your whole argument falls apart at
> We don't understand how they work, because we didn't build them. They built themselves.
We do understand how they work, we did build them. The mathematical foundation of these models are sound. The statistics behind them are well understood.
What we don’t exactly know is which parameters correspond to what results as it’s different across models.
We work backwards to see which parts of the network seem to relate to what outcomes.
> When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers.
Isn’t this the exact opposite of reality?
They get better at drawing because we improve their datasets, topologies, and their training methods and in doing so, teach them to draw.
They get better at reasoning because the engineers and data scientists building training sets do get better at philosophy.
They study what reasoning is and apply those learnings to the datasets and training methods.
That’s how CoT came about early on.
synthetic data is fine if you can ground the model somehow. that's why the o1/o3's improvements are mostly in reasoning, maths, etc., because you can easily tell if the data is wrong or not.
I completely don't understand the use for synthetic data. What good it's it to train a model basically on itself?
The value of synthetic data relies on having non-zero signal about which generated data is "better" or "worse". In a sense, this what reinforcement learning is about. Ie, generate some data, have that data scored by some evaluator, and then feed the data back into the model with higher weight on the better stuff and lower weight on the worse stuff.
The basic loop is: (i) generate synthetic data, (ii) rate synthetic data, (iii) update model to put more probability on better data and less probability on worse data, then go back to (i).
Thanks, that makes a lot more sense.
Counterpoint: o1-Pro is insanely good -- subjectively, it's as far above GPT4 as GPT4 was above 3. It's almost too good. Use it properly for an extended period of time, and one begins to worry about the future of one's children and the utility of their schooling.
o3, by all accounts, is better still.
Seems to me that things are progressing quickly enough.
Not sure what you are using it for, but it is terrible for me for coding; claude beats it always and hands down. o1 just thinks forever to come up with stuff it already tried the previous time.
People say that's just prompting without pointing to real million line+ repositories or realistic apps to show how that can be improved. So I say they are making todo and hello world apps and yes, there it works really well. Claude still beats it, every.. single.. time..
And yes, I use the Pro of all and yes, I do assume coding is done for most of people. Become a plumber or electrician or carpenter.
That so weird, it’s seems like everybody here prefers Claude.
I’ve been using Claude and openai in copilot and I find even 4o seems to understand the problem better. O1 definitely seems to get it right more for me.
I try to sprinkle 'for us/me' everywhere as much as I can; we work on LoB/ERP apps mostly. These are small frontends to massive multi million line backends. We carved a niche by providing the frontends on these backends live at the client office by a business consultant of ours: they simply solve UX issues for the client on top of large ERP by using our tool and prompting. Everything looks modern, fresh and nice; unlike basically all the competitors in this space. It's fast and no frontend people are needed for it; backend is another system we built which takes a lot longer of course as they are complex business rules. Both claude and o1 turn up something that looks similar but only the claude version will work and be, after less prompting, correct. I don't have shares in either and I want open source to win; we have all open (more open) solutions doing all the same queries and we evaluate all but claude just wins. We did manage even big wins with openai davinci in 2022 (or so; before chatgpt), but this is a massive boost allowing us to upgrade most people to business consultant and just have them build with clients real time and have the tech guys including me add manually tests and proofs (where needed) to know if we are actually fine. Works so much better than the slog with clients before; people are so bad at explaining at what they need, it was slowly driving me insane after doing it for 30+ years.
They're both okay for coding, though for my use cases (which are niche and involve quite a lot of mathematics and formal logic) o1/o1-Pro is better. It seems to have a better native grasp of mathematical concepts, and it can even answer very difficult questions from vague inputs, e.g.: https://chatgpt.com/share/676020cb-8574-8005-8b83-4bed5b13e1...
Claude also has a better workflow UI. It’ll maintain conversation context while opening up new windows to present code suggestions.
When I was still subscribing to OpenAI (about 4 months ago) this didn’t exist.
It exists as of last week with Canvas.
Different languages maybe? I find Sonnet v2 to be lacking in Rust knowledge compared to 4o 11-20, but excelling at Python and JS/TS. O1's strong side seems to be complex or quirky puzzle-like coding problems that can be answered in a short manner, it's meh at everything else, especially considering the price. Which is understandable given its purpose and training, but I have no use for it as that's exactly the sort of problem I wouldn't trust an LLM to solve.
Sonnet v2 in particular seems to be a bit broken with its reasoning (?) feature. The one where it detects it might be hallucinating (what's even the condition?) and reviews the reply, reflecting on it. It can make it stop halfway into the reply and decide it wrote enough, or invent some ridiculous excuse to output a worse answer. Annoying, although it doesn't trigger too often.
I find that o1 and Sonnet 3.5 are good and bad quite equally on different things. That's why I keep asking both the same coding questions.
We do the same (all requests go to o1, sonnet and gemini and we store the results for later to compare) automatically for our research: Claude always wins. Even with specific prompting on both platforms. Especially frontend it seems o1 really is terrible.
Every time I try Gemini, it's really subpar. I found that qwen2.5-coder-32b-instruct can be better.
Also, for me 50% 50% for Sonnet and o1, but although I'm not 100% sure about it, I think o1 is better with longer and more complicated (C++) code and debugging. At least from my brief testing. Also, OpenAI models seem to be more verbose - sometimes it's better - where I'd like additional explanation on chosen fields in a SQL schema, sometimes it's too much.
EDIT: Just asked both o1 and Sonnet 3.5 the same QML coding question, and Sonnet 3.5 succeeded, o1 failed.
Wins? What does this mean? Do you have any results? I see the claims that Claude is better for coding a lot but using it and using Gemini 2.0 flash and o1 and it sure doesn't seem like it.
Claude is trained on principles. GPT is trained on billions of edge cases. Which student do you prefer?
I keep reading this on HN so I believe it has to be true in some ways, but I don't really feel like there is any difference in my limited use (programming questions or explaining some concepts).
If anything I feel like it's all been worse compared to the first release of ChatGPT, but I might be wearing rose colored glasses.
It’s the same for me. I genuinely don’t understand how I can be having such a completely different experience from the people who rave about ChatGPT. Every time I’ve tried it’s been useless.
How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML. It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
One guy I work with uses it extensively and what it produces is essentially black boxes. If I find a problem with something “he” (or rather ChatGPT) has produced it takes him ages to commune with the machine spirit again to figure out how to fix it, and then he still doesn’t understand it.
I can’t help but see this as a time-bomb, how much completely inscrutable shite are these tools producing? In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Before people cry “o tempora o mores” at me and make parallels with the introduction of high-level languages, at least in order to write in a high-level language you need some basic understanding of the logic that is being executed.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch?
There are a lot of code monkeys working on boilerplate code, these people used to rely on stack overflow and now that chatgpt is here it's a huge improvement for them
If you work on anything remotely complex or which hasn't been solved 10 times on stack overflow chatgpt isn't remotely as useful
I found it very useful for writing a lexer and parser for a search DSL and React component recently:
https://github.com/williamcotton/search-input-query
first time I tried it, I asked it to find bugs in a piece of very well tested C code.
It introduced an off-by-one error by miscounting the number of arguments in an sprintf call, breaking the program. And then proceeded to fail to find that bug that it introduced.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML.
Part of this is, I think, anchoring and expectation management: you hear people say it's amazing and wonderful, and then you see it fall over and you're naturally disappointed.
My formative years started off with Commodore 64 basic going "?SYNTAX ERROR" from most typos plus a lot of "I don't know what that means" from the text adventures, then Metrowerks' C compiler telling me there were errors on every line *after but not including* the one where I forgot the semicolon, then surprises in VisualBasic and Java where I was getting integer division rather than floats, then the fantastic oddity where accidentally leaning on the option key on a mac keyboard while pressing minus turns the minus into an n-dash which looked completely identical to a minus on the Xcode default font at the time and thus produced a very confusing compiler error…
So my expectations have always been low for machine generated output. And it has wildly exceeded those low expectations.
But the expectation management goes both ways, especially when the comparison is "normal humans" rather than "best practices". I've seen things you wouldn't believe...
(All true, and all pre-dating ChatGPT).> It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
Aye. I've even had that with models forgetting the APIs they themselves have created, just outside the context window.
To me, these are tools. They're fantastic tools, but they're not something you can blindly fire-and-forget…
…fortunately for me, because my passive income is not quite high enough to cover mortgage payments, and I'm looking for work.
> In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Yes, if we're lucky.
If we're not, the models keep getting better and we don't have any "senior engineers" at all.
The ones who use it extensively are the same that used to hit up stackoverflow as the first port of call for every trivial problem that came their way. They're not really engineers, they just want to get stuff done.
Same, on every release from openai, anthropic I keep reading how the new model is so much better (insert hyperbole here) than the previous one yet when using it I feel like they are mostly the same as last year.
I'd say the same. I've tried a bunch of different AI tools, and none of them really seem all that helpful.
One use-case: They help with learning things quickly by having a chat and asking questions. And they never get tired or emotional. Tutoring 24/7.
They also generate small code or scripts, as well as automate small things, when you're not sure how, but you know there's a way. You need to ensure you have a way to verify the results.
They do language tasks like grammar-fixing, perfect translation, etc.
They're 100 times easier and faster than search engines, if you limit your uses to that.
They can't help you learn what they don't know themselves.
I'm trying to use them to read historical handwritten documents in old Norwegian (Danish, pretty much). Not only do they not handle the German-style handwriting, but what they spit out looks like the sort of thing GPT-2 would spit out if you asked it to write Norwegian (only slightly better than Swedish Muppet Swedish Chef's Swedish). It seems the experimental tuning has made it worse at the task I most desperately want to use it for.
And when you think about it, how could it not overfit in some sense, when trained on its own output? No new information is coming in, so it pretty much has to get worse at something to get better at all the benchmarks.
> perfect translation
Hah, no. They're good, but they definitely make stuff up when the context gets too long. Always check their output, just the same as you already note they need for small code and scripts.
If you've ever used any enterprise software for long enough, you know the exact same song and dance.
They release version Grand Banana. Purported to be approximately 30% faster with brand new features like Algorithmic Triple Layering and Enhanced Compulsory Alignment. You open the app. Everything is slower, things are harder to find and it breaks in new, fun ways. Your organization pays a couple hundred more per person for these benefits. Their stock soars, people celebrate the release and your management says they can't wait to see the improvement in workflows now that they've been able to lay off a quarter of your team.
Has there been improvements in LLMs over time? Somewhat, most of it concentrated at the beginning (because they siphoned up a bunch of data in a dubious manner). Now it's just part of their sales cycle, to keep pumping up numbers while no one sees any meaningful improvement.
O1 is effective, but it’s slow. I would expect a GPT-5 and mini to work as quickly as the 4 models.
I had a 30 min argument with o1-pro where it was convinced it had solved the halting problem. Tried to gaslight me into thinking I just didn’t understand the subtlety of the argument. But it’s susceptible to appeal to authority and when I started quoting snippets of textbooks and mathoverflow it finally relented and claimed there had been a “misunderstanding”. It really does argue like a human though now...
I had a similar experience with regular o1 about integral that was divergent. It was adamant that it wasn't and would respond to any attempt at persuasion with variants of "its a standard integral" with a "subtle cancellation". When I asked for any source for this standard integral it produced references to support its argument that existed but didn't actually contain the integral. When I told it the references didn't have the result and backpedalled (gaslighting!) to "I never told you they were in there". When I pointed out that in fact it did it insisted this was just a "misunderstanding". It only relented when I told it Mathematica agreed the integral was divergent. It still insisted it never said that the books it pointed to contained this (false, non-sensical) result.
This was new behaviour for me to see in an LLM. Usually the problem is these things would just fold when you pushed back. I don't know which is better, but being this confidently wrong (and "lying" when confronted with it) is troubling.
what do you use it for ?
The world is figuring out how to make this technology fit and work and somehow this is "behind" schedule. It's almost comical.
Reminds me of this Louis CK joke:
I was on an airplane and there was high-speed Internet on the airplane. That's the newest thing that I know exists. And I'm sitting on the plane and they go, open up your laptop, you can go on the Internet.
And it's fast, and I'm watching YouTube clips. It's amazing. I'm on an airplane! And then it breaks down. And they apologize, the Internet's not working. And the guy next to me goes, 'This is bullshit.' I mean, how quickly does the world owe him something that he knew existed only 10 seconds ago?"
https://www.youtube.com/watch?v=me4BZBsHwZs
The investors need their returns now!
Soon, all the middle class jobs will be converted to profits for the capital/data center owners, so they have to spend while they can before the economy crashes due to lack of spending.
People who say „it’s bullshit” are the ones that push the technological advance forward.
Not invariably. Some of those people are the ones who want to draw 7 red lines all perpendicular, some with green ink, some with transparent and one that looks like a kitten.
For anyone who hasn't seen what this comment is referencing: https://www.youtube.com/watch?v=BKorP55Aqvg
No, people who say "it's bullshit" and then do something to fix the bullshit are the ones that push technology forward. Most people who say "it's bullshit" instantly when something isn't perfect for exactly what they want right now are just whingers and will never contribute anything except unconstructive criticism.
Sounds like "yes but" rather than "no" otherwise you're responding to self created straw man.
Phrased another way, the world is still trying to figure out what this technology is actually good for besides generating spam. The "schedule" is finding an actual rationalization for the billions of dollars being pumped into it before the bubble pops.
There's someone with this comment in every thread. Meanwhile, no one answers this because they are getting value. Please take the time to learn, it will give you value.
I’m a consultant. Having looked at several enterprises, there’s a lot of work being done to make a lot of things that don’t really work.
The bigger the ambition, the harder they’re failing. Some well designed isolated use cases are ok. Mostly things about listening and summarizing text to aid humans.
I have yet to see a successful application that is generating good content. IMO replacing the first draft of content creation and having experts review and fix it is, like, the stupidest strategy you can do. The people you replace are the people at the bottom of the pyramid who are supposed do this work to upskill and become domain experts so they can later review stuff. If they’re no longer needed, you’re going to one day lose your reviewer, and with it, the ability to assess your generated drafts. It’s a foot gun.
> Having looked at several enterprises, there’s a lot of work being done to make a lot of things that don’t really work.
Is this a new phenomenon that started post-LLM?
I mean, no, not generally. but the success rate of other tools is much higher.
A lot of companies are trying to build these general purpose bots that just magically know everything about the company and have these but knowledge bases, but they just don’t work.
I'm someone who generally was a "doubter", but I've dramatically softened my stance on this topic.
Two things: I was casually watching Andreas Kling's streams on Ladybird development (where he was developing a JIT compiler for JS) and was blown away at the accuracy of completions (and the frequency of those completions)
Prior to this, I'd only ever copypasta'd code from ChatGPT output on occasion.
I started adopting the IDE/Editor extensions and prototyping small projects.
There's now small tools and utilities I've written that I'd not have written otherwise, or would have taken twice the time invested had I'd not used these tools.
With that said, they'd be of no use without oversight, but as a productivity enhancement, the benefits are enormous.
It gives me value but I am not even sure it is $20 a month of value at this point.
It was in 2023 but I picked all the low hanging fruit.
More importantly though, where is all the great output from the people who are getting so much value out of the models?
It is all privately held? How can that be with millions of people using these models?
> Meanwhile, no one answers this because they are getting value.
You're literally doing the same thing you're accusing of. Every HN thread is full of AI boosters claiming AI to be the future with no backing evidence.
Riddle me this. If all these people are "getting value", why are all these companies losing horrendous amounts of money? Why has nobody figured out how to be profitable?
> Please take the time to learn, it will give you value.
Yeah, yeah, just prompt engineer harder. That'll make the stochastic parrot useful. Anyone who has criticism just does so because they're dumb and you're smart. Same as it always was. Everyone opposed to the metaverse just didn't get it bro. You didn't get NFTs bro. You didn't get blockchain bro.
None of these previous bubbles had money in it (beyond scamming idiots), if AI wants to prove it's not another empty tech bubble, pay up. Show me the money. Should be easy, if it's automating so many expensive man-hours of labour. People would be lining up to pay OpenAI.
There’s clearly some value. People are paying for something.
> AI start-ups generate money faster than past hyped tech companies
https://www.ft.com/content/a9a192e3-bfbc-461e-a4f3-112e63d0b...
> Riddle me this. If all these people are "getting value", why are all these companies losing horrendous amounts of money? Why has nobody figured out how to be profitable?
While I agree that LLMs are not currently working great for most envisioned use cases; this premise here is not a good argument. Large LLM providers are not trying to be profitable at the moment. They’re trying to grow and that’s pretty sensible.
Uber was the poster child of this, and for all its mockery, Uber is now an unqualified profitable company.
> Why has nobody figured out how to be profitable?
From what I've seen claimed about OpenAI finances, this is easy: It's a Red Queen's race — "it takes all the running you can do, to keep in the same place".
If their financial position was as simple as "we run this API, we charge X, the running cost is Y", then they're already at X > Y.
But if that was all OpenAI were actually doing, they'd have stopped developing new versions or making the existing models more efficient some time back, while the rest of the industry kept improving their models and lowering their prices, and they'd be irrelevant.
> People would be lining up to pay OpenAI.
They are.
Not that this is either sufficient or necessary to actually guarantees anything about real value. For lack of sufficiency: people collectively paid a lot for cryptocurrencies and NFTs, too (and before then and outside tech, homeopathic tinctures and sub-prime mortgages); For lack of necessity: there's plenty of free-to-download models.
I get a huge benefit even just from the free chat models. I could afford to pay for better models, but why bother when free is so good? Every time a new model comes out, the old paid option becomes the new free option.
That was puzzles me now. Everyone with a semblance of expertise in engineering knows that if you start with a tool and try to find a problem it could solve you are doing it wrong. The right way is the opposite - you start with a problem, and find the best tool to solve it, and if it's the new shiny tool - so be it, but most of the time it's not.
Except the whole tech world starting with the CEOs seems to do it the "wrong" way with LLMs. People and whole companies are encouraged to find what these things might be actually useful for.
I use them to:
• Build toys that would otherwise require me to learn new APIs (I can read python, but it's not my day job)
• Learn new things like OpenSCAD
• To improve my German
• Learn about the world by allowing me to take photos of things in this world that I don't understand and ask them a question about the content, e.g. why random trees have bands or rectangles of white paint on them
• Help me shopping, by taking a photo of the supermarket that I happen to be in at the time and ask them where I should look for some item I can't find
• Help with meal prep, by allowing me to get a recipe based on what food and constraints I've got at hand rather than the traditional method of "if you want x, buy y ingredients"
Even if they're just an offline version of Wikipedia or Google, they're already a more useful interface for the same actual content.
For a company that sees itself as the undisputed leader and that wants to raise $7 trillion to build fabs, they deserve some of the heaviest levels of scrutiny in the world.
If OpenAI's investment prospectus relies on them reaching AGI before the tech becomes commoditized, everyone is going to look for that weakness.
Everyone's comparing o1 and claude, but neither really work well enough to justify paying for them in my experience for coding. What I really want is a mode where they ask clarifying questions, ideally many of them, before spitting out an answer. This would greatly improve utility of producing something with more value than an auto-complete.
Just tell it to do that and it will. Whenever I ask an AI for something and I'm pretty sure it doesn't have all the context I literally just say "ask me clarifying questions until you have enough information to do a great job on this."
And this chain of prompts cumulated with the improved CoT reasoner would accrue a lot more enhanced results. More in line with what the coming agentic era promises.
Yes. You can only do so much with the information you get in. The ability to ask good questions, not just of itself in internal monologue style, but actually of the user, would fundamentally make it better since it can get more information in.
As it is now, it has a bad habit of, if it can't answer the question you asked, instead answering a similar-looking question which it thinks you may have meant. That is of course a great strategy for benchmarks, where you don't earn any points for saying you don't know. But it's extremely frustrating for real users, who didn't read their question from a test suite.
I know multiple people that carefully prompt to get that done. The model outputs in direct token order, and can't turn around, so you need to make sure that's strictly followed. The system can and will come up with post-hoc "reasoning".
Have you used them to build a system to ask you clarifying questions?
Or even instructed them to?
have you tested that this helps? seems pretty simple to script with an agent framework
Just today I got Claude to convert a company’s PDF protocol specification into an actual working python implementation of that protocol. It would have been uncreative drudge work for a human, but I would have absolutely paid a week of junior dev time for it. Instead I wrote it alongside AI and it took me barely more than an hour.
The best part is, I’ve never written any (substantial) python code before.
I have to agree. It's still a bit hit or miss, but the hits are a huge time and money saver especially in refactoring. And unlike what most of the rather demeaning comments in those HN threads state, I am not some 'grunt' doing 'boilerplate work'. I mostly do geometry/math stuff, and the AIs really do know what they're talking about there sometimes. I don't have many peers I can talk to most of the time, and Claude is really helping me gather my thoughts.
That being said, I definitely believe it's only useful for isolated problems. Even with Copilot, I feel like the AIs just lack a bigger context of the projects.
Another thing that helped me was designing an initial prompt that really works for me. I think most people just expect to throw in their issue and get a tailored solution, but that's just not how it works in my experience.
It would seem you don't care too much about verifying its output or about its correctness. If you did, it wouldn't take you just an hour. I guess you'll let correctness be someone else's problem.
Archive.is does not work for this article, does anyone have a workaround?
Right. "You have been blocked", is what I get.
But this works: https://www.msn.com/en-us/money/other/the-next-great-leap-in...
this one does https://archive.md/L7fOF (it is just the previous snapshot)
In my intuition it makes sense that there is going to be some significant friction in LLM development going forward. We're talking about models that will cost upwards of $1bn to train. Save for a technological breakthrough, GPT-6/7 will probably have to wait for hardware to catch up.
I think the main bottleneck right now is training data - they've basically exhausted all public sources of data, so they have to either pay humans to generate new data from scratch or pay for the reasoning models to generate (less useful) synthetic training data. The next bottleneck is hardware, and the least important bottleneck is money.
What I find odd is that o1 doesn't support attaching text documents to chats the way 4o does. For a model that specializes in reasoning, reading long documents seems like a natural feature to have.
If Sama ever reads this, I have no idea why no users seem to focus on this, but it would be really good to prioritise being able to select which model you can use with the custom myGPTs. I know this maybe hard or not possible without recreating them , but I still dont think it's possible.
I dont think most customers realise how much better the models work with custom GPTs.
You can use the new project feature for that. That's a way of grouping conversations, adding files, etc. Should work with o1 pro as well apparently.
"When using custom instructions or files, only GPT-4o is available". Straight out of the ChatGPT web interface when you try to select which model you want to use.
It seems google has a massive advantage here since they can tap all of YouTube to train. I wonder what openai is using for its video data source.
Train for what? For making videos? Train from people’s comments? There’s a lot of garbage on AI slop on youtube, how would this be sifted out? I think there’s more value here on HN in terms of training, but even that, to what avail?
YouTube is such a great multimodal dataset—videos, auto-generated captions, and real engagement data all in one place. That’s a strong starting point for training, even before you filter for quality. Microsoft’s Phi-series models already show how focusing on smaller, high-quality datasets, like textbooks, can produce great results. You could totally imagine doing the same thing with YouTube by filtering for high-quality educational videos.
Down the line, I think models will start using video generation as part of how they “think.” Picture a version of GPT that works frame by frame—ask it to solve a geometry problem, and it generates a sequence of images to visualize the solution before responding. YouTube’s massive library of visual content could make something like that possible.
How about just an updated gpt 4o with all newer data? It would go a long way. Currently it doesn't know anything since Oct 2023 (without having to do a web search).
Good that we already have AGI in o3.
probably because it isn't any better
Meta question: @dang, can we ban MSN links and instead link directly to the original source?
https://archive.md/jKbLs