152 points by knbknb 2 months ago
It's not just AI. Our trade's history is chockfull of valiant efforts to solve problems that were overrun by the exponential decline of computing costs by the time they really worked properly.
Remember DSEE / ClearCase? They had all sorts of complicated virtual file systems to deliver tagged and branched contents of source code repositories. But drive space expanded with a Moore's-law style curve and now we have "git pull". Far simpler. System administrators don't hate our guts for adopting git.
Remember PHIGS? We needed display lists for graphics engines because host machines were too slow. Silicon Graphics took the other approach, and now we have GL.
Remember terminal concentrators like Digital's LAT? You don't? Good. I wish I didn't. (Handling 9600 baud interrupts was too big a load for a host machine. Really.)
Remember optical typesetting machines? The digital outlines / images for creating nice-looking letters used to be too big and complex to use for creating the images for actual pages. You want to use Univers or Gill Sans to set a document? Fine. Buy a Selectric typeball. Or go pay Linotype or Monotype a bundle for a little optical thingy with images of all the letters on it. Take the lid off your typesetting machine and put that thingy into it. You want to set Japanese? Too bad for you. Apple, Adobe, Chuck Bigelow and Kris Holmes, and Matt Carter, and Donald Knuth, decided to ride the exponential rocket and the rest is history.
The bright side: Margaret Hamilton and her team on the Apollo moonshot project used simple, reliable, radiation-hardened, and redundant computers with bugfree software to get those guys to the moon and back.
Let's be careful: Generals always fight the last war. Before starting new valiant efforts, we should carefully assess whether the appropriate technology for the planned delivery date is JMOS -- Just a Matter of Software. Sometimes it might not be the case. But most often it will be.
Ironically everything old is new again; now we have https://vfsforgit.org/ because keeping everything on one disk is too big, and OpenGL ES gets rid of immediate mode because communications with the host CPU is too slow.
OpenGL's immediate mode has been discouraged for a very long time, though.
And raytracing is about to do the same thing to all the rasterization innovations of the past few decades. Funny how these kinds of things feel partly depressing.
This post oversimplifies the story by putting all the emphasis on compute power. Deep Blue using brute force to solve chess obviously fits this pattern, but the others?
Let's take computer vision. Alex Krizhevsky et al destroyed the ImageNet competition with a neural network in 2012, kicking off the current AI hype cycle. Essentially everything in their model had been known about since the late 80s. But we also didn't know how to train deep networks much before this (it turned out how you initialise the neural network was important), and we also didn't have a big enough dataset to train such a deep model on until ImageNet. Since then, we have built models that perform another order of magnitude better than the 2012 model, mainly because of improvements to the architectures (a combination of ingenuity and a lot of trial and error).
So compute is necessary, but it isn't enough, I don't buy that we've 'brute forced' image recognition in the same way as chess.
Likewise, the search in Go is a Monte Carlo search, very different from the kind of search used in chess. And the neutral nets in alpha go are guiding where to run the search, which is very very different from brute force search.
Many of these things have required the giant leaks in compute, but still wouldn't work at all without the concurrent improvements in algorithms.
Along these lines, here's a classic blog post:
"Grötschel, an expert in optimization, observes that a benchmark production planning model solved using linear programming would have taken 82 years to solve in 1988, using the computers and the linear programming algorithms of the day. Fifteen years later — in 2003 — this same model could be solved in roughly 1 minute, an improvement by a factor of roughly 43 million. Of this, a factor of roughly 1,000 was due to increased processor speed, whereas a factor of roughly 43,000 was due to improvements in algorithms! Grötschel also cites an algorithmic improvement of roughly 30,000 for mixed integer programming between 1991 and 2008."
I think the authors point though is that all our effort into the algorithms where algorithms to do just one thing - search - and that we used that in conjunction with more compute power.
I'll agree that the author emphasizes compute power, but his real point still holds. Monte Carlo search may not be classic brute force, and neural networks guiding it may also not be standard, but the two just let you effectively search on a massive scale.
I don't think what you are saying contradicts the text. What he's saying is that we need to put our efforts into how to design and use the tools that tackle the problem space, rather than reasoning about the problem space itself, e.g. how to use neural nets, monte carlo search, etc. That doesn't mean we just throw a for-loop at the data.
But this doesn't work either - convolutional layers in neural networks have a very specific structure, which encodes strong prior knowledge that we have about the problem space (translation invariance). If we just had multilayer perceptions, we wouldn't be talking about this right now.
>convolutional layers in neural networks have a very specific structure, which encodes strong prior knowledge that we have about the problem space
Yes. The point of the author is that it doesn't do this symbolically.
Don't get confused with the terms "brute force", "neural net", etc.
The main idea of the author is that AI that uses brute force, simpler statistical methods, NN, etc, wins over AI that tries to implement some deeper reasoning about the problem domain the way humans do (when thinking about it consciously).
Hmm, I'm not sure I see the difference. Why is it not "symbolic"? The symbols that construct the neural network are what encodes translation invariance -- not some vector of reals.
Symbolic as in "symbolic algebra systems", "symbolic AI", etc . Not as in having some symbols in the code for a NN.
A NN doesn't work with the domain objects directly and abstractly (e.g. considering a face, facial features, smiles, etc as first class things and doing some kind of symbolic manipulation at that level).
It crunches numbers that encode patterns capturing those things, but its logic is all about numbers, links between one layer and another, and so on -- it's not a program dealing with high level abstract entities.
To put it in another way, it's the difference between teaching, say, Prolog to identify some concept and a NN to do the same.
E.g. from the link "The most successful form of symbolic AI is expert systems, which use a network of production rules. Production rules connect symbols in a relationship similar to an If-Then statement. The expert system processes the rules to make deductions and to determine what additional information it needs, i.e. what questions to ask, using human-readable symbols."
A NN does nothing like that (not in any immediate, first class, way, where the rules are expressed as plain rules given by the programmer, like "foo is X", "bar has the Y property", etc).
Here's another way to see it: how you'd solve a linear equation with regular algebra (the steps and transformation etc), and how a NN would encode the same.
A symbolic algebra system will let you express an equation in symbolic form (more or less like a mathematician would write it), and even show you all the intermediate steps you'd take until the solution.
A NN trained to solve the same type of equations doesn't do that (and can't). It just tells you the answer (or an approximation thereof).
Your observations don't disprove anything in the article. The author doesn't say that brute force is the way to go, and one has to cease all tries to optimize just waiting for computing power to increase. He says that essentially all attempts to build AI by modeling human thinking lead to nowhere, because their are inherently too complex, with simpler statistical, or search-based methods constantly winning. What you wrote here is absolutely in line with his words. "Neural network" contrary to its name doesn't work by emulating human.
This really needs a better title, or at least a subtitle. (I clicked only because I recognized the domain name; the title itself is vague puffery which gives no promise of being interesting...)
'compute beats clever'? 'fast > fancy'? 'better big than bright'? 'in the end, brute force wins'?
Or at least call it "AI's Bitter Lesson" or something!
I agree with this fellow. The non-descriptive titles are a glaring issue that needs to be fixed.
I know ~nothing about AI. But to me, this seems a great summary. And as a one-time developmental biologist, I'm struck by these observations:
> One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
> The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries.
From what I know about brain development, "search and learning" are key mechanisms. Plus massive overproduction and selection, which is basically learning. Maybe that's the main takeaway from biology.
I was thinking the same thing. When I "play" with a new language or tool or concept, I try lots of different scenarios (search), until I can reliably predict how the new thing will work (learning).
That's pretty much how our brains develop. Neurons are vastly overproduced, during fetal development through the first few years. Ones that make useful connections, and do useful stuff, survive. And the rest die.
Also, as in evolution, ~random variations occur during neuronal proliferation, so there's also selection on epigenetic differences. The same sort of process occurs in the immune system.
In this way, organisms can transcend limitations of their genetic sequences. There's learning at levels of both structure and function.
> Time spent on one is time not spent on the other.
But from the AI researcher's view, "the other" doesn't require time; someone else is advancing the hardware, which is outside of the AI researcher's area. The general method to-be-run on better hardware is known today; it doesn't have to be researched. So should the AI researcher just twiddle their thumbs, waiting for the hardware to improve?
In games like chess, it has long been known that if you have a big enough database, an optimal game can be played. For each board configuration, the entry in the database supplies the optimal move.
I think the author would agree with you, that improving chess-playing is not likely to be a productive domain for advancing AI in future, and for essentially the reasons you give here: it is too formal (this guess is based, in part, on some of the authors' other writing.)
You can only reliably say that something which has happened several times is something that will always happen if you know the reason why it happened several times. This article seems to think it's Moore's law, which has ended. I think history tends to go in cycles as people over-index on whatever worked well for the last n decades.
This seems to be a trait of humanity in general, not just IT. Look at the finance industry: banks loan based on historical track records, right up until the bubble pops. Every. Single. Time.
Momentum lends itself to easy statistical support, and paradigm shifts are notoriously difficult to predict with any degree of confidence.
No matter how long in the tooth a particular trend might be, matter how certain you are that a reversal is imminent, it's hard to push against the weight of trend-line evidence.
This kind of over-indexing to statistical correlations in past data, without understanding the casual mechanism is a common criticism of contemporary AI techniques, especially "deep" RL ;-)
Maybe as someone who hasn't been steeped in AI for the past several decades, I'm not able to appreciate the depth of emotion behind Sutton's statements. I find this kind of vague pontificating to be boring. It seems aimed more at convincing the author of a position, than doing a critical analysis and convincing the reader.
Will compute+data solve AI? Will structured algorithms solve AI? Will neuroscience provide key breakthroughs? Who knows? We'll find out when we find out. This article provides little value beyond historical reminiscing.
In the meanwhile, there are many interesting problems begging for attention where data or compute is limited.
I'll believe it when these techniques can solve real problems in a robust way. More importantly, when it can solve real problems in a robust way, opinions don't matter! Proponents won't need to go around trying to convince people almost in the manner of superstitions belief.
I'm suspicious of hindsight bias.
I'm not sure if that had written 10-20 years ago, that "learning" would figure out so predominantly. Who's to say there isn't a third such big method?
Also, while the lesson fit the facts (easy in hindsight), it will hold... until it doesn't anymore. The end of Moore's law has been long heralded, and we're starting to enter this era. Progress can be made, probably, but transistors can't get any tinier, and you can only put so much cores on one chip. Hardware may continue to provide "free gains" but those will likely be at an order of magnitude (or more?) smaller than before.
In fact I stopped my research in supervised learning and switched to collaborative agents ~97 because I saw ML as deadended. Agents would be the thing! (hint: not so much, so far)
I think Moore's law is interesting. Technically Moore's law is about transistor density/integration, in effect it became about CPU performance and similar phenom were seen in disc and network performance. Just now we are seeing a move in general architecture away from spinning rust and towards chip based storage - ssd's and optane (or just huge DRAM) which has been much slower than I thought, but is still happening. There will be more progress as we wring out the opportunities in architecture and network devices, but overall you are right - no more Mooore's.
Also there's been a wave of progress funded by excitement - it's really hard to see how Google justified the spend on Deepmind's TPU infastructure, but they did - in contrast to a rational investment from a research council which would never have bought into Alphazero and the rest.
There's opportunity to do more - big gaps in datasets, evaluation metrics, refinement of techniques (mac nets, adversarials etc), but it's back to hardscrabble now - and I'm interested to see if this is a Warren Buffet moment. After all, you only see who's wearing shorts when the tide goes out!
> I'm suspicious of hindsight bias.
As you should be, but anti-hindsight bias (or hindsight anti-bias?) is even worse. Not accusing you of that; just making a general observation. Hindsight should inform, not bias in either direction.
>I'm suspicious of hindsight bias.
It's the best kind of bias.
I hate the term "AI" (even though I am CTO of a company with "AI" in its name, but since we use machine learning/DCNNs in our systems, it’s very trendy). The problem with "AI" is the "intelligence" part. Intelligence is a construct like "porn", like in the famous words of Justice Stewart about defining the latter "...but I know it when I see it". At best, it's very ambiguous -- and misleading at worst. They have been many attempts to quantitatively and qualitatively define intelligence -- none of which I find particularly satisfying and neither do any three given scientists in a room agree on a single interpretation. My problem with TFA is that it is comparing apples to oranges; deep convolutional networks are very different tools useful for a subset of problems than the ones using Bayesian inference and other statistical methods. Brute force methods like image morphology, object counting, and transforms are useful for even yet an entire Other set of problems. To say that one has displaced another is an error, in fact in most useful, modern, production systems a combination of all three is utilized, each to their purpose. To make direct comparisons between them while implying the historical decisions to use one or the other are due to Moore's Law is a false equivalence.
I clearly need my morning coffee.
The majority of businesses and governments are insisting on learning this bitter lesson anew.
In the minds of many business executives and government officials, "explainable AI" means, quite literally, "show it to me as a linear combination of a small number of features" (sometimes called "drivers" or "factors") that have monotonic relationships with measurable outcomes.
I would go further: most people are understandably scared and worried of intelligence that arises from scalable search and learning by self-play.
If explainable AI is too limiting, what's the alternative? What's going to happen when someone gets hauled into court to be held liable for their non-explainable AI's outcomes? Oh right, I know, they'll hide behind corporate limited-liability shenanigans, until people get tired of that and go straight for the guillotines. Or maybe the non-explainable AI's owners will decide they want to prevent that, and ... do you want Skynet? Because that's how you get Skynet. Maybe spend some time thinking about the various awful ways this could play out before concluding that explainability isn't important.
I love the phrase “explainable AI”. We still can’t explain how our intelligence works with any degree of biological detail.
We can't explain the implementation details, but a human system can literally explain the logic she used to reach a decision. For example, for applications in the justice system that AI has been recommended for, this is a highly important quality.
Eeeeeeh... what we do is more like parallel construction. We can give a series of plausible steps to explain where we ended up, but sometimes we can't really explain why we did some of the steps.
> They said that ``brute force" search may have won this time, but it was not a general strategy
It seems self-evident to me that 'brute force' is the most general strategy there is. Any (computable) problem is theoretically solvable by just coding the simplest, most obvious solution, which is usually pretty easy. The run-time of brute force is sometimes an issue, but that just means you need more of it!
On the other hand at some point we will want AI to learn based on a small number of interactions. IE an AI that beats a human after playing 10 games of chess/starcraft etc. Right now it takes millions of training matches. Many real world situations don't happen that often so this fundamentally limits applications of the current generation of AI.
Humans require few or many examples depending on the situation. For example my toddler got some candy from the hospital gift shop six months ago. Walking past the same place today he took off on his own accord and went directly to the candy. A single example was enough to train his candy finding algorithm. On the other hand he has had hundreds of examples of putting on his shoes and still can not manage this on his own.
Show me a human that can win a Starcraft championship after only playing ten games. If you find any, they learned the mechanics and strategy somewhere else. That’s transfer learning, appears to be in its infancy in the ML community but making progress.
The scale matters here. I think a better metric for your parent comment would be the delta in skill per game played.
A human is significantly better on game 11 than game 1 (I recently got into Starcraft). Current ML systems are not. It's up for discussion how to take the human's previous experience into account, but the total amount of experience is significantly less that the computer's.
I guess, then, the next step in AI research should be to develop a deep learning network for automatic training examples generation ~~ An AI Machine Learning Trainer of some kind.
This is a horrible post. It advocates to just throwing out research and replacing it with black boxes. Sure, they approximate (or even fully extract) the actual behaviour, but they are opaque.
I'd like to remind everyone that science is in the business of understanding, making things less opaque, less magic and engineering benefits from both.
I think you missed the point. It is saying that when we build AI systems put our understanding of a problem space into the system, we inhibit the development of a system that can create its own understanding of the problem space. He gives three very good examples of that. He also explains why people are tempted to do that: it's satisfying and initially improves the results.
Very interested article. I've often railed against putting your thumb on the scale (or even worse, second-guessing) machine learning models by applying too many so-called "business rules," especially post hoc rules. If the model doesn't learn on its own what you consider to be obvious structure of the data, then either you've chosen the completely wrong model and it won't be able to learn non-obvious truths either, OR your expectations were wrong and the "obvious" structure isn't real. Ineed, the model discovering, entirely on its own, the same structure as a human analyst is often the first evidence we see that the model works! In any case it does you no good to try and force it to fit your preconceptions with post hoc adjustments. Either fix your preconceptions (if they are mistaken) or switch to a model which naturally agrees with you.
Sutton takes an even more extreme point of view, suggesting that most human feature engineering is similarly a waste of time. It's hard to argue with if you know the history: some of the best computer vision algorithms use exactly two mathematical operations, convolution (which itself only requires addition and multiplication) and the max(a,b) function. (This is true because both ReLU and MaxPool can be implemented with max(), and because a fully connected layer is a special case of a convolution.) A similar story occurred in speech recognition, with human designed features like phonemes and MFCC are giving way to end-to-end learning. Indeed, even general purpose fully connected neural networks started to work much better once the biologically-motivated sigmoid() and tanh() were replaced with the much simpler ReLU function, which is is just ReLU(x) = max(x, 0). What really made the difference was leveraging GPUs, using more data, automating hyperparameter selection, and so on.
I'm not sure if there's really a lesson there, or if this trend will hold indefinitely, and I'm not sure why the lesson would be "bitter" even if it holds. Certainly opinions are mixed. One the one hand, many researchers such as Andrew Ng are big proponents of end-to-end learning; on the other hand, no one can currently conceive of training a self-driving car that way. But avoiding domain-specific, human-engineered features may be a viable guiding philosophy for making big, across-the-board advances in machine learning.
> Sutton takes an even more extreme point of view, suggesting that most human feature engineering is similarly a waste of time.
In fact, wasn't there an article posted here recently saying that they'd had good results with using learned features to feed traditional non-NN-based machine learning?
Open AI said as much when discussing their move to a for-profit LP model.
They anticipate that real advances will be made by massively scaling up the compute power they throw at any given problem. That’s driving their fundraising efforts.
If the past 5 years are anything to go by, they’re right.
A reply: A better Lesson - https://rodneybrooks.com/a-better-lesson/
So if Moore's Law is slowing down and expected to end in 2025 (per it's Wikipedia entry https://en.wikipedia.org/wiki/Moore%27s_law ), does this "bitter lesson" then need to be reversed?
What's missing in this account is all the interesting stuff that came from the attempt to emulate human reasoning. Sure, it didn't get us image recognition or chess mastery, but we have Prolog, much of what we now know as Lisp, and proof assistants. Deep, powerful tools that augment, but do not replace, human cognition.
I learned a similar lesson from working on robots. At first I attempted to devise methods for Robonaut 2 to do things the way I do because it was designed to be like me. It was missing little things that made my approaches unfeasible, and it was infuriating. At that point I decided it only made sense to make methods that allow the agent to discover its own behaviors, because its merkwelt and my own will never be the same.
> Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded.
These aren't discarded, they are part of ML vision networks today. Edges are one of the 3x3 convolutions that a network can learn, SIFT/etc are the dense / clustering nets, I'll admit I just googled Generalized Cylinders (very interesting). There are others like SLAM as well.
margin: 1em auto;
to make this more readable on a desktop...
you're not wrong, but I installed bookmarklets on my ipad mini to increase and decrease font size because of _HN_, which sets lower-than-normal font sizes. My eyesight isn't great, but this site is the absolute pits for undersized text without max-width set.
And for some bizarre reason Firefox doesn't offer reader view.
I think without specialised CPUs, AI will remain futile.