Ask HN: Why is Ilya saying data is limited when the whole world is data?
In his recent talk Ilya S. said that the data running out is a fundamental constraint on the scaling laws. He said "we have but one internet".
But I don't understand: there is so much data in the real world beyond the internet. Webcams. Microphones. Cars. Robots... Everything can collect multimodal data and more importantly (for robots) even get feedback loops from reality.
So isn't data functionally infinite? And the only thing standing in the way is the number of sensors and open datastreams and datasets.
Please help me understand
There's a ton of recent work on data curation / synthetic data generation that shows that smaller high quality datasets go a lot further than scaling up on noisy web data.
The scaling law plots are log scale so to get more juice with naive scaling we'd need to invest exponentially more resources, and we're at a point where the juice is not worth the squeeze, so people will shift to moving the curve down with new architectures, better curated datasets and test time compute / RL.
See:
- FineWeb: https://arxiv.org/abs/2406.17557
- Phi-4: https://arxiv.org/abs/2412.08905
- DataComp: https://arxiv.org/abs/2406.11794
Feature selection is back!
Thank you, makes sense now
If you show an LLM one webcam feed to train on, that's useful. Two is even more useful. But there are diminishing returns. "Useful training data" is limited.
Humans get to PhD level with barely a drop of training data compared to what LLMs are trained on.
If there were infinite useful data, then scaling AI on data would make sense. Since there isn't, the way forward is getting more efficient at using the data we have.
Is it either or? Obviously there is a need for more efficiency. But clearly data in any webcam stream has meaningful non-random information about the world. And so the central dogma of deep learning then should mean that if we throw enough compute - the underlying useful information / pattern however miniscule will be found and compressed.
What am I missing?
Much of the user-generated data stored by tech companies is proprietary, which limits access by external parties.
Think about it from his perspective.
Data from the internet can be chunked, sorted, easily processed, and has a relatively high signal-to-noise ratio. Data from a webcam or a microphone -- if even legal to access in the first place -- would be a mess. Imagine chunking and processing 5TB of that sort of data. Seems to me that the effort would far outweigh the reward.
Robots are a different problem entirely. It's darkly amusing that simple problems of motion through space are more complex to replicate than painting the simulacra of a masterpiece, or acing the medical licensing exam. We'll probably have AGI before we can mimic the movement of the simple housefly.
Thank you. A few thoughts.
1. Legal clearly hasn't stopped anyone so far. And probably won't in the future if there is economic value. So I suggest we take this outside the equation for now. Obviously it will be a thing to take care of, but I'm asking theoretical questions.
2. Effort vs. reward. Is it just about this? Or is there something more? I.e. is there a clear plateau or is it just diminishing returns (in which case sufficiently low cost of energy solves it)
3.Robots: yes on a mechanical level. But robots don't need to be mechanically elegant and Uber precise in order to be useful.
WRT 2:
I do indeed think that it's mostly about effort vs. reward, and that sufficiently low cost of energy, and sufficient time resources, would solve it. But the cost of energy would have to be near zero, and the time allotted would have to be very generous. This is because most mic/webcam/etc. data is of very poor quality -- probably poor enough to actually poison datasets -- so it would need to be mercilessly cleaned up, and then laboriously chunked and sorted. When all's said and done, you're going to devote tremendous labor to cutting >99% of your data, and more labor in categorizing it.
It might be more fruitful to develop a model that creates new training data from internet-derived data. This is a lot more complicated than it may sound, and I don't think I ought to speculate here as to what might be involved, but it still seems more viable than sorting through a low-quality material library of babel.
You’re thinking “any data”, he’s thinking “useful data for training an LLM”.
But isn't the underlying hypothesis that any data with sufficient underlying pattern is ultimately useful if you throw enough compute at it? His own argument is that even from the nonsense of the internet LLMs could extract general models of the world...
What about all the books written since antiquity?
There are lots of problems where someone has to run experiments to generate data. If the most optimized possible process to perform the experiment is expensive and takes time to generate 1 data point, then all you can do is wait till more data is produced before a solution is found. Think drug discovery.
Hmm... But you don't have to rely on experiments, no? Ilya's original argument was that language is a representation of reality and so even in the noisy data from the internet with zero feedback loops and experiments - sufficient amount of compute could allow LLMs to get the underlying world model to some degree. Wouldn't the same hold true with cameras and robot interactions? Just predict the next frame of reality in the same way you predict the next token of language...
(Actions leading to reactions may or may not be part of the vector we are learning. I mean they should be, but not strictly necessary)
No? What am I missing? Just astronomical compute required? Or something more fundamental?
Try translating differential equations or musical notations or chemical formulas into english. When we find one language to be useless or inefficient at representing reality we create another language. Language is just a tool we use to think and transfer info from one chimp brain to another.
Complex systems studies is wisdom. We know how communication on internet behaves. Conway's Law hits hard and the processes of life are not dumb.
Access to physical reality is important when negotiating with the beings that can form under this constraint. People have apparently known this instinctively for a very long time and they are not going to give in to the demands of the AI industry.
It's a great mistake to humanize everything in your consciousness.