pona-a 4 days ago

I feel like extensive pretraining goes against the spirit of generality.

If you can create a general machine that can take 3 examples and synthesize a program that predicts the 4th, you've just solved oracle synthesis. If you train a network on all human knowledge, including puzzle making, and then fine-tune it on 99% of the dataset and give it a dozen attempts for the last 1%, you've just made an expensive compressor for test-maker's psychology.

  • Valectar 4 days ago

    This betrays a very naive concept of "knowledge" and "understanding". It presupposes that there's some kind of platonic realm of logic and reason that an AGI just needs to tap in to. But ultimately, there can be no meaning, or reasoning, or logic, without context. Matching a pattern of shapes presupposes the concept of a shape, which presupposes a concept of spatial relationships, which presupposes a concept of 3 or even 2 dimensional space. These things only seem obvious and implicit to you because they permeate the environment that your mind spent hundreds of millions of years evolving to interpret, and then tens of years consuming and processing to understand.

    The true test of an AGI is it's ability to assimilate disparate information into a coherent world-view, which is effectively what the pretraining is doing. And even then, it is likely that any intelligence capable of doing that will need to be "preloaded" with assumptions about the world it will occupy, structurally. Similar to the regions of the brain which are adept at understanding spatial relationships, or language, or interpreting our senses, etc.

    • gitfan86 4 days ago

      Yes, AGI was here at AlphaGo. People don't like that because they think it should have generalized outside of GO, but when you say AGI was here at AlphaZero which can play other games they again say not general enough. At this point is seem unlikely that AI will ever be general enough to satisfy the sceptics for the reason you said. There will always be some domain that requires training on new data.

      • devmor 4 days ago

        You're calling an Apple an Orange and complaining that everyone else wont refer to it as such. AGI is a computer program that can understand or learn any task a human can, mimicking the cognitive ability of a human.

        It doesn't have to actually "think" as long as it can present an indistinguishable facsimile, but if you have to rebuild its training set for each task, that does not qualify. We don't reset human brains from scratch to pick up new skills.

      • FrustratedMonky 4 days ago

        "AI will ever be general enough to satisfy the sceptics for the reason you said"

        Also

        People keep thinking "General" means one AI can "do everything that any human can do everywhere all at once".

        When really, humans are also pretty specialized. Humans have Years of 'training' to do a 'single job'. And they do not easily switch tasks.

        • devmor 4 days ago

          >When really, humans are also pretty specialized. Humans have Years of 'training' to do a 'single job'. And they do not easily switch tasks.

          What? Humans switch tasks constantly and incredibly easily. Most "jobs" involve doing so rapidly many times over the course of a few minutes. Our ability to accumulate knowledge of countless tasks and execute them while improving on them is a large part of our fitness as a species.

          You probably did so 100+ times before you got to work. Are you misunderstanding the context of what a task is in ML/AI? An AI does not get the default set of skills humans take for granted, its starting as a blank slate.

          • FrustratedMonky 3 days ago

            You're looking at small tasks.

            You don't have a human spend years getting an MBA, then drop them in a Physics Lab and expect them to perform.

            But that is what we want from AI, to do 'all' jobs equally as great as any individual human in that one job.

            • devmor 3 days ago

              That is a result we want from AI, it is not the exhaustive definition of AGI.

              There are steps of automation that could fulfill that requirement without ever being AGI - it’s theoretically possible (and far more likely) that we achieve that result without making a machine or program that emulates human cognition.

              It just so happens that our most recent attempts are very good at mimicking human communication, and thus are anthropomorphized as being near human cognition.

              • FrustratedMonky 3 days ago

                I agree.

                I'm just making point that for AI "General" Intelligence.

                That humans are also not as "General" as we assume in these discussion. Humans are also limited in a lot of ways, and narrowly trained, make stuff up, etc...

                So even a human isn't necessarily a good example for what AGI would mean. Human is not a good target either.

                • devmor 3 days ago

                  Humans are our only model of the type of intelligence we are trying to develop, any other target would be a fantasy with no control to measure against.

                  Humans are extremely general. Every single type of thing we want an AGI to do is a type of things that a human is good at doing, and none of those humans were designed specifically to do that thing. It is difficult for humans to move from specialization to specialization, but we do learn them with only the structure to "learn, generally" being our scaffolding.

                  What I mean by this is that we do want AGI to be general in the way a human is. We just want it to be more scalable. It's capacity for learning does not need to be limited by material issues (i.e. physical brain matter constraints), time, or time scale.

                  So where a human might take 16 years to learn how to perform surgery well, and then need another 12 years to switch to electrical engineering, an AGI should be able to do it the same way, but with the timescale only limited by the amount of hardware we can throw at it.

                  If it has to be structured from the ground up for each task, it is not a general intelligence, it's not even comparable to humans, let alone scalable beyond us.

                  • FrustratedMonky 2 days ago

                    So find a single architecture that can be taught to be an electrical engineer or a doctor.

                    Where today those are being done, but specialized architectures, models, combination of methods.

                    Then that would be a 'general' intelligence, the one type of model that can do either. Trained to be an engineer or doctor. And like a human once trained, they might not do the other job well. But they did both start with same 'tech', like humans all have the same architecture in the 'brain'.

                    I don't think it will be an LLM, it will be some combo of methods in use today.

                    Ok. I'll buy that. I'm not sure everyone is using 'general' in that way. I think more-often people think a single AI instance that can do everything/everywhere/all at once. Be an engineer and doctor at same time. Since it can do all the tasks at same time, it is 'general'. Since we are making AI's that can do everything, could have a case statement inside to switch models, half joking. At some point all the different AI methods will be incorporated together and will appear even more human/general.

                    • gitfan86 2 days ago

                      Right, but even at that point the sceptics will still stay that it isn't "truly general" or unable to do X in the same way a human does. Intelligence like beauty is in the eye of the beholder.

                • Jensson 3 days ago

                  But if humans are so bad, what does that say about a model that can't even do what humans can?

                  Humans are a good target since we know human intelligence is possible, its much easier to target something that is possible rather than some imaginary intelligence.

          • gitfan86 3 days ago

            No human ever got good at Tennis without learning the rules. Why would we not allow an AI to also learn the rules before expecting it to get good at tennis.

            • Jensson 3 days ago

              > Why would we not allow an AI to also learn the rules before expecting it to get good at tennis.

              The model should learn the rule, don't make a model based on the rules. When you make a model based on the rules then it isn't a general model.

              Human DNA isn't made to play tennis, but a human can still learn to play it. The same should be for a model, it should learn it, the model shouldn't be designed by humans to play tennis.

        • butlike 3 days ago

          So you're saying AI can be incompetent at a grander scale. Got it.

          • FrustratedMonky 3 days ago

            Yes. It can be as good and bad as a human. Humans also make up BS to answers.

  • jshmrsn 4 days ago

    If the machine can decide how to train itself (adjust weights) when faced with a type of problem it hasn’t seen before, then I don’t think that would go against the spirit of general intelligence. I think that’s basically what humans do when they decide to get better at something, they figure out how to practice that task until they get better at it.

    • pona-a 4 days ago

      In-context learning is a very different problem from regular prediction. It is quite simple to fit a stationary solution to noisy data, that's just a matter of tuning some parameters with fairly even gradients. In-context learning implies you're essentially learning a mesa-optimizer for the class of problems you're facing, which in the form of transformers means essentially means fitting something not that far from a differentiable Turing machine with no inductive biases.

  • fsndz 4 days ago

    Exactly. That's basically the problem with a lot of the current paradigm, they don't allow true generalisation. That's why some people say there won't be any AGI anytime soon: https://www.lycee.ai/blog/why-no-agi-openai

    • exe34 4 days ago

      "true generalisation" isn't really something a lot of humans can do.

      • godelski 4 days ago

          > humans CAN do.
        
        I think people often get confused with claims

          - Humans CAN generalize
          - Humans CAN reason
          - Humans CAN be intelligent
          - Humans CAN be conscious
        
        Generalization[0] is something MOST humans CAN do, but MOST humans DO NOT do. Do not confuse "can" and "are".

        One of my pet peeves is how often qualifying words are just ignored. They are critical parts of any communication.[1]

        Another pet peeve is over anthropomorphization. Anthropomorphism is a useful tool, but.. well... we CAN over generalize ;)

        [0] I don't know what you mean by "true generalization". I'm not going to address that because you can always raise the bar for what is "true" and let's try to be more concrete. Maybe I misunderstand. I definitely misunderstand.

        [1] Classic example: someone says "most x are y" then there's a rebuttal of "but x_1 isn't y" or "I'm x and I'm not y" or some variation. Great! Most isn't all? This is not engaging in good faith and there's examples found with any qualifying word. It is quite common to see.

      • fsndz 4 days ago

        the thing is LLMs don't even do the kind of generalisations the dumbest human can do. while simultaneously doing some stuff the smartest human probably can't

      • leptons 4 days ago

        Sadly, a lot of humans choose not to think. We're in an age of willful ignorance.

      • fritzo 4 days ago

        i feel like you're overgeneralizing here

  • tripplyons 4 days ago

    I think that most human learning comes from years of sensory input. Why should we expect a machine to generalize well without any background?

    • aithrowawaycomm 4 days ago

      Newborns (and certainly toddlers) seem to understand the underlying concepts for these things when it comes to visual/hepatic object identification and "folk physics":

        A short list of abilities that cannot be performed by CompressARC includes:
      
        Assigning two colors to each other (see puzzle 0d3d703e)
        Repeating an operation in series many times (see puzzle 0a938d79)
        Counting/numbers (see puzzle ce9e57f2)
        Translation, rotation, reflections, rescaling, image duplication (see puzzles 0e206a2e, 5ad4f10b, and 2bcee788)
        Detecting topological properties such as connectivity (see puzzle 7b6016b9)
      
      Note: I am not saying newborns can solve the corresponding ARC problems! The point is there is a lot of evidence that many of the concepts ARC-AGI is (allegedly) measuring are innate in humans, and maybe most animals; e.g. cockroaches can quickly identify connected/disconnected components when it comes to pathfinding. Again, not saying cockroaches can solve ARC :) OTOH even if orcas were smarter than humans they would struggle with ARC - it would be way too baffling and obtuse if your culture doesn't have the concept of written standardized tests. (I was solving state-mandated ARCish problems since elementary school.) This also applies to hunter-gatherers, and note the converse: if you plopped me down among the Khoisan in the Kalahari, they would think I was an ignorant moron. But it makes as much sense scientifically to say "human-level intelligence" entails "human-level hunter-gathering" instead of "human-level IQ problems."
      • Ukv 4 days ago

        > there is a lot of evidence that many of the concepts ARC-AGI is (allegedly) measuring are innate in humans

        I'd argue that "innate" here still includes a brain structure/nervous system that evolved on 3.5 billion years worth of data. Extensive pre-training of one kind or another currently seems the best way to achieve generality.

        • SiempreViernes 4 days ago

          So all billons spent finding tricks and architectures that perfom well haven't resulted in any durable structures in contemporary LLMs?

          Each new training from scratch is a perfect blank slate and the only thing ensuring words come out is the size of the corpus?

          • Ukv 4 days ago

            > Each new training from scratch is a perfect blank slate [...]?

            I don't think training runs are done entirely from scratch.

            Most training runs in practice will start from some pretrained weights or distill an existing model - taking some model pretrained on ImageNet or Common Crawl and fine-tuning it to a specific task.

            But even when the weights are randomly initialized, the hyperparameters and architectural choices (skip connections, attention, ...) will have been copied from previous models/papers by what performed well empirically, sometimes also based on trying to transfer our own intuition (like stacking convolutional layers as a rough approximation of our visual system), and possibly refined/mutated through some grid search/neural architecture search on data.

        • onemoresoop 4 days ago

          Sure and LLMs ain’t nothing of this sort. While they’re an incredible feat in technology, they’re just a building block for intelligence, an important building block I’d say.

      • mystified5016 4 days ago

        Newborn brains aren't blank, they are complex beyond our current ability to understand. All mammals are born with a shocking amount of instinctual knowledge built right into their genome.

        All organisms are born pre-trained because if you can't hide or survive the moment you're born, you get eaten.

        • leptons 4 days ago

          > if you can't hide or survive the moment you're born, you get eaten.

          uhhh... no, most newborns can't "hide or survive the moment they're born", no matter the species. I'm sure there are a few examples, but I seriously doubt it's the norm.

          Many species survive by reproducing en masse, where it takes many (sometimes thousands of) eaten offspring for one to survive to adulthood.

          • K0balt 4 days ago

            In humans at least, they lack the motor control to attempt to hide , gather, or hunt obviously. But they do plenty of other stuff instinctively. With my latest, we learned that infants are inherently potty trained (defecation) if you pay attention to the cues…and I was surprised to find that that was true, baby communicates the need to go and knows what’s happening without any training at all. Amazed to have almost zero soiled diapers at one month.

            Makes sense though, I’m pretty sure mammals don’t do well with the insects and diseases that come with waste saturated bed.

          • anon7000 4 days ago

            The point clearly still stands: every species on the planet has a long list of attributes and behaviors directly attributable to evolution and “pretraining.” And many, many more based on education (the things a lioness teaches her cubs.)

            I’m not sure we would call anyone intelligent today if they had no education. Intelligence relies on building blocks that are only learned, and the “masters” of certain fields are drawing on decades and decades of learnings about their direct experience.

            So our best examples of intelligence include experience, training, knowledge, evolutionary factors, what have you — so we probably need to draw on that to create a general intelligence. How can we expect to have an intelligence in a certain field if it hasn’t spent a lot of time “ruminating on”/experiencing/learning about/practicing/evolving/whatever, on those types of problems?

          • mystified5016 3 days ago

            Please do some research into developmental psychology. Babies are far, far more sophisticated than you seem to believe.

            • taneq 3 days ago

              Please have a baby and report back first-hand observations. Yes, they're far, far more sophisticated than most (all?) humans can comprehend, but they're also completely incapable for multiple months after birth. This isn't unexpected, human babies are born at what would still be mid-late gestation in almost any other mammal.

              That quote about how "the only intuitive interface ever devised was the nipple"? Turns out there's still a fair bit of active training required all around to even get that going. There's no such thing as intuitive, only familiar.

            • leptons 3 days ago

              Have you ever seen a human baby attempt to hide upon birth? No? Thought so.

      • naasking 4 days ago

        > Newborns (and certainly toddlers) seem to understand the underlying concepts for these things when it comes to visual/hepatic object identification and "folk physics"

        Yes, they enjoy millions of years of pretraining thanks to evolution, ie. their pretrained base model has some natural propensity for visual, auditory, and tactile sensory modalities, and some natural propensity for spatial and causal reasoning.

    • Krasnol 4 days ago

      I'd guess it's because we don't want to have another human. We want something better. Therefore, the expectations on the learning process are way beyond what humans do. I guess some are expecting some magic word (formula) which would be like a seed with unlimited potential.

      So like humans after all but faster.

      I guess it's just hard to write a book about the way you write that book.

    • andoando 4 days ago

      It does but it also generalizes extremely well

      • tripplyons 4 days ago

        I haven't seen a convincing argument that it is more sample efficient than a neural network that has seen an equivalent amount of lifetime information. Yann LeCun gave an interesting explanation of how even in the first few years of a child's life, they have seen much more information than the largest pretrained models have.

  • Davidzheng 4 days ago

    Arc is equivalent to a distribution over four tuples of images--with no prior the last image is uniformly distributed given the first three...

  • ta8645 4 days ago

    The issue is that general intelligence is useless without vast knowledge. The pretraining is the knowledge, not the intelligence.

    • dchichkov 4 days ago

      For long context sizes AGI is not useless without vast knowledge. You could always put a bootstrap sequence into the context (think Arecibo Message), followed by your prompt. A general enough reasoner with enough compute should be able to establish the context and reason about your prompt.

      • ta8645 4 days ago

        Yes, but that just effectively recreates the pretraining. You're going to have to explain everything down to what an atom is, and essentially all human knowledge if you want to have any ability to consider abstract solutions that call on lessons from foreign domains.

        There's a reason people with comparable intelligence operate at varying degrees of effectiveness, and it has to do with how knowledgeable they are.

        • pona-a 4 days ago

          Would that make in-context learning a superset or a subset of pretraining?

          This paper claimed transformers learn a gradient-descent mesa-optimizer as part of in-context learning, while being guided by the pretraining objective, and as the parent mentioned, any general reasoner can bootstrap a world model from first principles.

          [0] https://arxiv.org/pdf/2212.07677

          • ta8645 4 days ago

            > Would that make in-context learning a superset or a subset of pretraining?

            I guess a superset. But it doesn't really matter either way. Ultimately, there's no useful distinction between pretraining and in-context learning. They're just an artifact of the current technology.

      • conradev 4 days ago

        Isn't knowledge of language necessary to decode prompts?

        • dchichkov 4 days ago

          0 1 00 01 10 11 000 001 010 011 100 101 110 111 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110

          And no, I don't think the knowledge of language is necessary. To give a concrete example, tokens from TinyStories dataset (the dataset size is ~1GB) are known to be sufficient to bootstrap basic language.

    • tripplyons 4 days ago

      I'm not at all experienced in neuroscience, but I think that humans and other animals primarily gain intelligence by learning from their sensory input.

      • FergusArgyll 4 days ago

        You don't think a lot is encoded in genes from before we're born?

        • aaronblohowiak 4 days ago

          >a lot

          this is pretty vague. I certainly dont think a mastery of any concept invented in last thousand years would be considered encoded in genes though we would want or expect an AGI to be able to learn calculus for instance. In terms of "encoded in genes", I'd say most of what is asked or expected of AGI goes beyond what feral children (https://en.wikipedia.org/wiki/Feral_child) were able to demonstrate.

        • tripplyons 4 days ago

          I don't disagree, but I think there is much more information encoded in the brain. I believe this phenomenon is called the genomic bottleneck.

          There are a few orders of magnitude more neural connections in a human than there are base pairs in a human genome. I would also assume that there are more than 4 possible ways for neural connections to be formed, while there are only 4 possible base pairs. Also, most genetic information corresponds to lower level functions.

    • pona-a 4 days ago

      I don't think so. A lot of useful specialized problems are just patterns. Imagine your IDE could take 5 examples of matching strings and produce a regex you can count on working? It doesn't need to know the capital of Togo, metabolic pathways of the eukaryotic cell, or human psychology.

      For that matter, if it had no pre-training, it means it can generalize to any new programming languages, libraries, and entire tasks. You can use it to analyze the grammar of a dying African language, write stories in the style of Hemingway, and diagnose cancer on patient data. In all of these, there are only so many samples to fit on.

      • ta8645 4 days ago

        Of course, none of us have exhaustive knowledge. I don't know the capital of Togo.

        But I do have enough knowledge to know what an IDE is, and where that sits in a technological stack, i know what a string is, and all that it relies on etc. There's a huge body of knowledge that is required to even begin approaching the problem. If you posted that challenge to an intelligent person from 2000 years ago, they would just stare at you blankly. It doesn't matter how intelligent they are, they have no context to understand anything about the task.

        • pona-a 4 days ago

          > If you posted that challenge to an intelligent person from 20,00 years ago, they would just stare at you blankly.

          Depending on how you pose it. If I give you a long enough series of ordered cards, you'll on some basic level begin to understand the spatiotemporal dynamics of them. You'll get the intuition that there's a stack of heads scanning the input, moving forward each turn, either growing the mark, falling back, or aborting. If not constrained by using matrices, I can draw you a state diagram, which would have much clearer immediate metaphors than colored squares.

          Do these explanations correspond to some priors in human cognition? I suppose. But I don't think you strictly need them for effective few-shot learning. My main point is that learning itself is a skill, which generalist LLMs do possess, but only as one of their competencies.

          • ta8645 4 days ago

            Well Dr. Michael Levin would agree with you in the sense that he ascribes intelligence to any system that can accomplish a goal through multiple pathways. So for instance the single-celled Lacrymaria, lacking a brain or nervous system, can still navigate its environment to find food and fulfill its metabolic needs.

            However, I assumed what we're talking about when we discuss AGI is what we'd expect a human to be able to accomplish in the world at our scale. The examples of learning without knowledge you've given, to my mind at least, are a lower level of intelligence that doesn't really approach human level AGI.

      • bloomingkales 4 days ago

        A lot of useful specialized problems are just patterns.

        It doesn't need to know the capital of Togo, metabolic pathways of the eukaryotic cell, or human psychology.

        What if knowing those things distills down to a pattern that matches a pattern of your code and vice versa? There's a pattern in everything, so know everything, and be ready to pattern match.

        If you just look at object oriented programming, you can easily see how knowing a lot translates to abstract concepts. There's no reason those concepts can't be translated bidirectionally.

        • sejje 4 days ago

          In fact, a couple years ago they were saying that training models on code translated to better logical thinking in other domains "for free"

    • raducu 4 days ago

      > The pretraining is the knowledge, not the intelligence.

      I thought the knowledge is the training set and the intelligence is the emergent/side effect of reproducing that knowledge by making sure the reproduction is not rote memorisation?

      • ta8645 4 days ago

        I'd say that it takes intelligence to encode knowledge, and the more knowledge you have, the more intelligently you can encode further knowledge, in a virtuous cycle. But once you have a data set of knowledge, there's nothing to emerge, there are no side effects. It just sits there doing nothing. The intelligence is in the algorithms that access that encoded knowledge to produce something else.

        • esafak 4 days ago

          The data set is flawed, noisy, and its pieces are disconnected. It takes intelligence to correct its flaws and connect them parsimoniously.

          • ta8645 4 days ago

            It takes knowledge to even know they're flawed, noisy, and disconnected. There's no reason to "correct" anything, unless you have knowledge that applying previously "understood" data has in fact produced deficient results in some application.

            That's reinforcement learning -- an algorithm, which requires accurate knowledge acquisition, to be effective.

            • esafak 4 days ago

              Every statistical machine learning algorithm, including RL, deals with noisy data. The process of fitting aims to remove the sampling noise, revealing the population distribution, thereby compressing it into a model.

              The argument being advanced is that intelligence is the proposal of more parsimonious models, aka compression.

              • ta8645 4 days ago

                I've lost track of what we're disagreeing about.

  • godelski 4 days ago

      > I feel like extensive pretraining goes against the spirit of generality.
    
    What do you mean by generality?

    Pretraining is fine. It is even fine for pursuit to AGI. Humans and every animal has "baked in" memory. You're born knowing how to breath and have latent fears (chickens and hawks).

    Generalization is the ability to learn on a subset of something and then adapt to the entire (or a much larger portion) of the superset. It's always been that way. Humans do this, right? You learn some addition, subtraction, multiplication, and division and then you can do novel problems you've never seen before that are extremely different. We are extremely general here because we've learned the causal rule set. It isn't just memorization. This is also true for things like physics, and is literally the point of science. Causality is baked into scientific learning. Of course, it is problematic when someone learns a little bit of something and thinks they know way more about it, but unfortunately ego is quite common.

    But also, I'm a bit with you. At least with what I think you're getting at. These LLMs are difficult to evaluate because we have no idea what they're trained on and you can't really know what is new, what is a slight variation from the training, and this is even more difficult considering the number of dimensions involved (meaning things may be nearly identical in latent space though they don't appear so to us humans).

    I think there's still a lot of ML/AI research that can and SHOULD be done at smaller scales. We should be studying more about this adaptive learning and not just in the RL setting. One major gripe I have with the current research environment is that we are not very scientific when designing experiments. They are highly benchmark/data-set-evaluation focused. Evaluation needs to go far beyond test cases. I'll keep posting this video of Dyson recounting his work being rejected by Fermi[0][1]. You have to have a good "model". What I do not see happening in ML papers is proper variable isolation and evaluation based on this: i.e. hypothesis testing. Most papers I see are not providing substantial evidence for their claims. It may look like this, but the devil is always in the details. When doing extensive hyper-parameter tuning it becomes very difficult to determine if the effect is something you've done via architectural changes, change in the data, change in training techniques, or change in hyperparameters. To do a proper evaluation would require huge ablations with hold-one-out style scores reported. This is obviously too expensive, but the reason it gets messy is because there's a concentration on getting good scores on whatever evaluation dataset is popular. But you can show a method's utility without beating others! This is a huge thing many don't understand. Worse, by changing hyper-parameters to optimize for the test-set result, you are doing information leakage. Anything you change based on the result of the evaluation set, is, by definition, information leakage. We can get into the nitty gritty to prove why this is, but this is just common practice these days. It is the de facto method and yes, I'm bitter about this. (a former physicist who came over to ML because I loved the math and Asimov books)

    [0] https://www.youtube.com/watch?v=hV41QEKiMlM

    [1] I'd also like to point out that Dyson notes that the work was still __published__. Why? Because it still provides insights and the results are useful to people even if the conclusions are bad. Modern publishing seems to be more focused on novelty and is highly misaligned from scientific progress. Even repetition helps. It is information gain. But repetition results in lower information gain with each iteration. You can't determine correctness by reading a paper, you can only determine correctness by repetition. That's the point of science, right? As I said above? Even negative results are information gain! (sorry, I can rant about this a lot)

AIorNot 4 days ago

I was thinking about this lex friedman podcast with Marcus Hutter. Also, Joshua Bach defined intelligence as the ability to accurately model reality.. is lossless compression itself intelligence or a best fit model- is there a difference? https://www.youtube.com/watch?v=E1AxVXt2Gv4

  • meowface 4 days ago

    Incidentally, François Chollet, creator of ARC-AGI, argues in this 2020 Lex Fridman podcast that intelligence is not compression: https://youtu.be/-V-vOXLyKGw

    • visarga 4 days ago

      Compression works on finite data, and AI models need to keep open for new data. So they should be less greedy.

  • akomtu 4 days ago

    It's the ability to find a simple model that predicts a complex reality with a high accuracy and low latency. So we need to consider the four axes, and AI will be a region in this space.

    Edit. In fact, there is a simple test for intelligence: can you read a function in C and tell how a change in input changes the output. For complex algorithms, you have to build an internal model because how else are you going to run qsort on million items? That's also how you'd tell if a student is faking it or really understands. A harder test would be to do the opposite: from a few input/output examples come up with an algorithm.

  • fastball 4 days ago

    For a quicker tie in than watching a whole podcast, the result of Hutter's stance takes the form of the Hutter prize[1], which in some ways has many of the same goals as ARC-AGI, but sees compression itself as the goalpost towards intelligence.

    [1] http://prize.hutter1.net/

cocomutator 4 days ago

I'm trying to distill the essence of their approach, which imho is concealed behind inessential and particular details such as the choice of this or that compression scheme or prior distributions.

It seems like the central innovation is the construction of a "model" which can be optimized with gradient descent, and whose optimum is the "simplest" model that memorizes the input-output relationships. In their setup, "simplest" has the concrete meaning of "which can be efficiently compressed" but more generally it probably means something like "whose model complexity is lowest possible".

This is in stark contrast to what happens in standard ML: typically, we start by prescribing a complexity budget (e.g. by choosing the model architecture and all complexity parameters), and only then train on data to find a good solution that memorizes input-output relationship.

The new method is ML on its head: we optimize the model so that we reduce its complexity as much as possible while still memorizing the input-output pairs. That this is able to generalize from 2 training examples is truly remarkable and imho hints that this is absolutely the right way of "going about" generalization.

Information theory happened to be the angle from which the authors arrived at this construction, but I'm not sure that is the essential bit. Rather, the essential bit seems to be the realization that rather than finding the best model for a fixed pre-determined complexity budget, we can find models with minimal possible complexity.

  • getnormality 4 days ago

    The idea of minimizing complexity is less novel than it may seem. Regularization terms are commonly added to loss objectives in optimization, and these regularizers can often be interpreted as penalizing complexity. Duality allows us to interpret these objectives in multiple ways:

    1. Minimize a weighted sum of data error and complexity.

    2. Minimize the complexity, so long as the data error is kept below a limit.

    3. Minimize the error on the data, so long as the complexity is kept below a limit.

    It does seem like classical regularization of this kind has been out of fashion lately. I don't think it plays much of a role in most Transformer architectures. It would be interesting if it makes some sort of comeback.

    Other than that, I think there are so many novel elements in this approach that it is hard to tell what is doing the work. Their neural architecture, for example, seems carefully hacked to maximize performance on ARC-AGI type tasks. It's hard to see how it generalizes beyond.

    • cocomutator 4 days ago

      Right, but see how the complexity budget is prescribed ahead of time: we first set the regularization strength or whatever, and then optimize the model. The result is the best model with complexity no greater than the budget. In this standard approach, we're not minimizing complexity, we're constraining it.

      • getnormality 4 days ago

        Again, because of duality, these are not really different things.

        To your point in the other thread, once you start optimizing both data fidelity and complexity, it's no longer that different from other approaches. Regularization has been common in neural nets, but usually in a simple "sum of sizes of parameters" type way, and seemingly not an essential ingredient in recent successful models.

        • cocomutator 4 days ago

          ah I see what you mean, sorry it took me a while. Yes, you're right, the two are dual: "fix complexity budget then optimize data error", and "fix data error budget then optimize cx".

          I'm struggling to put a finger on it, but it feels that the approach in the blog post has the property that it finds the _minimum_ complexity solution, akin to driving the regularization strength in conventional ml higher and higher during training, and returning the solution at the highest such regularization that does not materially degrade the error (epislon in their paper). information theory plays the role of a measuring device that allows them to measure the error term and model complexity on a common scale, so as to trade them off against each other in training.

          I haven't thought about it much but i've seen papers speculating that what happens in double-descent is finding lower complexity solutions.

  • ealexhudson 4 days ago

    I think you're right about the essential ingredient in this finding, but I feel like this is a pretty ARC-AGI specific result.

    Each puzzle is kind of a similar format, and the data that changes in the puzzle is almost precisely that needed to deduce the rule. By reducing the amount of information needed to describe the rule, you almost have to reduce your codec to what the rule itself is doing - to minimise the information loss.

    I feel like if there was more noise or arbitrary data in each puzzle, this technique would not work. Clearly there's a point at which that gets difficult - the puzzle should not be "working out where the puzzle is" - but this only works because each example is just pure information with respect to the puzzle itself.

    • cocomutator 4 days ago

      I agree with your observation about the exact noise-free nature of the problem. It allows them to formulate the problem as "minimize complexity such that you memorize the X-y relationship exactly". This would need to be generalized to the noisy case: instead of demanding exact memorization, you'd need to prescribe an error budget. But then this error budget seems like an effective complexity metaparameter, doesn't it, and we're back to square zero of cross-validation.

      • ealexhudson 4 days ago

        If we think of the 'budget' as being similar to a bandwidth limit on video playback, there's a kind of line below which the picture starts being pretty unintelligible, but for the most part that's a slider: the less the budget, the slightly less accurate playback you get.

        But because this is clean data, I wonder if there's basically a big gap here: the codec that encodes the "correct rule" can achieve a step-change lower bandwidth requirement than similar-looking solutions. The most elegant ruleset - at least in this set of puzzles - always compresses markedly better. And so you can kind of brute-force the correct rule by trying lots of encoding strategies, and just identify which one gets you that step-change compression benefit.

EigenLord 4 days ago

Interesting. I've been slowly coming to the conclusion that the way forward with machine learning is actually less "machine learning" as we've grown accustomed with it, less pretraining, less data, less search, more direct representation, symbolic processing, more constraint-satisfaction, meta-learning etc. All those things we need less of (pretraining, data, etc) are messy, brute force, and contingent. Working with them, you'll always be dependent on the quality of your data, which is fine if you want to data-mine, but not fine if you want to model the underlying causes of the data.

From my (admittedly sketchy, rushed) understanding of what they're doing, they're essentially trying to uncover the minimal representation of the solution/problem space. Through their tracking of the actual structure of the problem through equivariences, they're actually deriving something like the actual underlying representation of the puzzle and how to solve them, rather than hoping to pick up on this from many solved examples.

pyinstallwoes 4 days ago

Impressive documentation and explanation. Thank you.

It pleases me to find this as it supports my own introspection (heck, it’s in my profile!)

> Intelligence is compressing information into irreducible representation.

d--b 4 days ago

> ARC-AGI, introduced in 2019, is an artificial intelligence benchmark designed to test a system’s ability to infer and generalize abstract rules from minimal examples. The dataset consists of IQ-test-like puzzles, where each puzzle provides several example images that demonstrate an underlying rule, along with a test image that requires completing or applying that rule. While some have suggested that solving ARC-AGI might signal the advent of artificial general intelligence (AGI), its true purpose is to spotlight the current challenges hindering progress toward AGI

Well they kind of define intelligence as the ability to compress information into a set of rules, so yes, compression does that…

  • bubblyworld 4 days ago

    This is not at all as circular or obvious as you are claiming here. Have you actually tried the ARC-AGI problems yourself? They are quite subtle, and test a broad range of abstract concepts. For reference o1-preview scores 21% on the public eval, compared to 34% in the OP.

naasking 4 days ago

> Despite these constraints, CompressARC achieves 34.75% on the training set and 20% on the evaluation set—processing each puzzle in roughly 20 minutes on an RTX 4070.

This phrasing suggests that each puzzle took 20 mins, so for the 100 puzzle challenge that's 33.3 hours, which exceeds the target of 12 hours for the challenge. Pretty cool approach though.

unixpickle 4 days ago

This seems to be pretty much exactly a standard Bayesian deep learning approach, albeit with a heavily engineered architecture.

cgadski 4 days ago

I'm very excited that we're figuring out how to use deep learning on small numbers of data points!

I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.

About their overall method, they write:

> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.

I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):

> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.

So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.

Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.

Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:

> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.

As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.

The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.

  • bubblyworld 3 days ago

    Thanks for this insightful comment. One of my first questions about this was how it avoids some kind of latent space collapse with such a tiny dataset.

    Do you think it's accurate to describe equivariance as both a strength and a weakness here? As in it allows the model to learn a useful compression, but you have to pick your set of equivariant layers up front, and there's little the model can do to "fix" bad choices.

programjames 4 days ago

Here's what they did:

1. Choose random samples z ~ N(μ, Σ) as the "encoding" of a puzzle, and a distribution of neural network weights p(θ) ~ N(θ, <very small variance>).

2. For a given z and θ, you can decode to get a distribution of pixel colors. We want these pixel colors to match the ones in our samples, but they're not guaranteed to, so we'll have to add some correction ε.

3. Specifying ε takes KL(decoded colors || actual colors) bits. If we had sources of randomness q(z), q(θ), specifying z and θ would take KL(p(z) || q(z)) and KL(p(θ) || q(θ)) bits.

4. The authors choose q(z) ~ N(0, 1) so KL(p(z) || q(z)) = 0.5(μ^2 + Σ^2 - 1 - 2ln Σ). Similarly, they choose q(θ) ~ N(0, 1/2λ), and since Var(θ) is very small, this gives KL(p(θ) || q(θ)) = λθ^2.

5. The fewer bits they use, the lower the Kolmogorov complexity, and the more likely it is to be correct. So, they want to minimize the number of bits

a * 0.5(μ^2 + Σ^2 - 1 - 2ln Σ) + λ * θ^2 + c * KL(decoded colors || actual colors).

6. Larger a gives a smaller latent, larger λ gives a smaller neural network, and larger c gives a more accurate solution. I think all they mention is they choose c = 10a, and that λ was pretty large.

They can then train μ, Σ, θ until it solves the examples for a given puzzle. Decoding will then give all the answers, including the unknown answer! The main drawback to this method is, like Gaussian splatting, they have to train an entire neural network for every puzzle. But, the neural networks are pretty small, so you could train a "hypernetwork" that predicts μ, Σ, θ for a given puzzle, and even predicts how to train these parameters.

  • throwaway314155 4 days ago

    Is there _any_ hope of being able to understand how this works without a background in statistics and information theory? So far, old-school pretraining gets heavy bonus points just for being plausible and easy to understand.

    • essexedwards 4 days ago

      Yes, there is hope for a high-level heuristic understanding. Here's my attempt to explain in more familiar terms.

      They train a new neural network from scratch for each problem. The network is trained only on the data about that problem. The loss function tries to make sure it can map the inputs to the outputs. It also tries to keep the network weights small so that the neural network is as simple as possible. Hopefully a simple function that maps the sample inputs to the sample outputs will also do the right thing on the test input. It works 20~30% of the time.

      • throwaway314155 4 days ago

        Great explanation, thanks. I have some followups if you have the time!

        a.) Why does this work as well as it does? Why does compression/fewer-parameters encourage better answers in this instance?

        b.) Will it naturally transfer to other benchmarks that evaluate different domains? If so does that imply an approach similarly robust to pre-training that can be used for different domains/modalities?

        c.) It works 20-30% of the time - do the researchers find any reason to believe that this could "scale" up in some fashion so that, say, a single larger network could handle any of the problems, rather than needing a new network for each problem? If so, would it improve accuracy as well as robustness?

        • programjames 4 days ago

          Boo, go read the other comments that explain all of this instead of wasting people's time.

          • throwaway314155 4 days ago

            > I have some followups *if you have the time*

            Emphasis mine. No one should feel obligated to answer my questions. I had hoped that was obvious.

    • programjames 4 days ago

      Sure! I'll try to give a more intuitive explanation. Basically, it's been known for awhile that intelligence is very related to compression. Why? Suppose you're trying to describe a phenomena you see (such as an IQ test). To precisely describe what's going on, you'd have to write it in some kind of formal language, like a programming language. Maybe your description looks like:

      This wavey piece can be described by algorithm #1, while this spikey piece can be described by algorithm #2, while this...

      More precisely, you try to write your phenomena as a weighted sum of these algorithms:

          phenomena = sum weight * algorithm
      
      There are exponentialy more algorithms that have more bits, so if you want this sum to ever converge, you need to have exponentially smaller weights for longer algorithms. Thus, most of the weight is concentrated in shorter algorithms, so a simple explanation is going to be a really good one!

      What the authors are trying to do is find a simple (small number of bits) algorithm that reconstructs the puzzles and the example solutions they're given. As a byproduct, the algorithm will construct a solution to the final problem that's part of the puzzle. If the algorithm is simple enough, it won't be able to just memorize the given examples—it has to actually learn the trick to solving the puzzle.

      Now, they could have just started enumerating out programs, beginning with `0`, `1`, `00`, `01`, ..., and seeing what their computer did with the bits. Eventually, they might hit on a simple bit sequence that the computer interprets as an actual program, and in fact one that solves the puzzle. But, that's very slow, and in fact the halting problem says you can't rule out some of your programs running forever (and your search getting stuck). So the authors turned to a specialized kind of computer, one that they know will stop in a finite number of steps...

      ...and that "computer" is a fixed-size neural network! The bit sequence they feed in goes to determine (1) the inputs to the neural network, and (2) the weights in the neural network. Now, they cheat a little, and actually just specify the inputs/weights, and then figure out what bits would have given them those inputs/weights. That's because it's easier to search in the input/weight space—people do this all the time with neural networks.

      They initialize the space of inputs/weights as random normal distributions, but they want to change these distributions to be concentrated in areas that correctly solve the puzzle. This means they need additional bits to specify how to change the distributions. How many extra bits does it take to specify a distribution q, if you started with a distribution p? Well, it's

             - sum q(x) log p(x)         +        sum p(x) log p(x)
          (expected # bits for random q)  (expected # bits for random p)
      
      This is known as the KL-divergence, which we write as KL(q||p). They want to minimize the length of their program, which means they want to minimize the expected number of additional bits they have to use, i.e. KL(q(inputs)||p(inputs)) + KL(q(weights)||p(weights)).

      There's a final piece of the puzzle: they want their computer to exactly give the correct answer for the example solutions they know. So, if the neural network outputs an incorrect value, they need extra bits to say it was incorrect, and actually here's the correct value. Again, the expected number of bits is just going to be a KL-divergence, this time between the neural network's output, and the correct answers.

      Putting this altogether, they have a simple computer (neural network + corrector), and a way to measure the bitlength for various "programs" they can feed into the computer (inputs/weights). Every program will give the correct answers for the known information, but the very simplest programs are much more likely to give the correct answer for the unknown puzzles too! So, they just have to train their distributions q(inputs), q(weights) to concentrate on programs that have short bitlengths, by minimizing the loss function

          KL(q(inputs)||p(inputs)) + KL(q(weights)||p(weights)) + KL(outputs||correct answers)
      
      They specify p(inputs) as the usual normal distribtuion, p(weights) as a normal distribution with variance around 1/(dimension of inputs) (so the values in the neural network don't explode), and finally have trainable parameters for the mean and variance of q(inputs) and q(weights).
  • uh_uh 4 days ago

    Given that all parameters are trained jointly at inference time and a single sample of z is supposed to encode ALL inputs and outputs for a given puzzle (I think), I don't quite understand the role of the latent z here. Feels like μ and Σ could be absorbed into θ (and an additional variance parameter).

    Although if they partition z such that each section corresponds to one input and run f_θ through those sections iteratively, then I guess it makes sense.

    • programjames 3 days ago

      I agree, z should be absorbed into μ and Σ, e.g. you always input `[1 0 0 ... 0]`, and the first layer of the neural network would essentially output z. They would have to stop approximating the KL(q(θ)||p(θ)) as O(θ^2) though, so maybe this is more computationally efficient?

      • uh_uh 3 days ago

        Could be. Also, as you imply, they'd have to loosen the regularization penalty on θ, and maybe it's difficult to loosen it such that it won't become too prone to overfitting.

        Maybe their current setup of keeping θ "dumb" encourages the neural network to take on the role of the "algorithm" as opposed to the higher-variance input encoded by z (the puzzle), though this separation seems fuzzy to me.

  • naasking 4 days ago

    > The main drawback to this method is, like Gaussian splatting, they have to train an entire neural network for every puzzle.

    They discuss how to avoid this in the section "Joint Compression via Weight Sharing Between Puzzles".

  • mmazing 4 days ago

    > But, the neural networks are pretty small, so you could train a "hypernetwork" that predicts μ, Σ, θ for a given puzzle, and even predicts how to train these parameters.

    Kurt Godel (or maybe Douglas Hofstadter, rather) would raise an eyebrow. :)

YeGoblynQueenne 4 days ago

>> The idea that efficient compression by itself lies at the heart of intelligence is not new (see, e.g., Hernández-Orallo & Minaya-Collado, 1998; Mahoney, 1999; Hutter, 2005; Legg & Hutter, 2007). Rather than revisiting those theoretical discussions, we make a practical demonstration instead.

There we go again. Claim: compression <something something> intelligence. Evidence: 34.75% on ARC AGI.

Like Carl Sagan once pointed out, "Observation: You couldn't see a thing. Conclusion: dinosaurs".

https://www.youtube.com/watch?v=w_N_IYi2c0E&themeRefresh=1

  • naasking 4 days ago

    > There we go again. Claim: compression <something something> intelligence.

    I don't see how this could possibly be controversial. Do you agree or disagree that, on a long enough timeline, intelligence lets you build a model of a system that can accurately and precisely reproduce all observations without simply remembering and regurgitating those observations? The model will always be smaller than simply remembering those observations, ergo, the intelligence maximally compressed the observations.

    > Evidence: 34.75% on ARC AGI.

    With no pretraining. Literally the only thing it's seen is 3 examples. And it does it with a very reasonable timeframe. I don't think you appreciate how impressive this is.

    • YeGoblynQueenne 4 days ago

      >> Do you agree or disagree that, on a long enough timeline, intelligence lets you build a model of a system that can accurately and precisely reproduce all observations without simply remembering and regurgitating those observations?

      Why should I agree or disagree with that? Who says that? Is it just some arbitrary theory that you came up with on the spot?

      >> The model will always be smaller than simply remembering those observations, ergo, the intelligence maximally compressed the observations.

      "Maximally compressed the observations" does not follow from the model being smaller. You're missing an assumption of optimality. Even with that assumption, there is nothing in the work above to suggest that compression has something to do with intelligence.

      >> I don't think you appreciate how impressive this is.

      I do and it's not. 20% on unseen tasks. It only works on tasks with a very specific structure, where the solution is simply to reproduce the input with some variation, without having to capture any higher order concepts. With enough time and compute it should be possible to do even better than that by applying random permutations to the input. The result is insignificant and a waste of time.

      • naasking 3 days ago

        > Why should I agree or disagree with that? Who says that? Is it just some arbitrary theory that you came up with on the spot?

        Intelligence has some capabilities. I'm asking you if the capabilities of intelligence encompass the ability described, yes or no?

        > "Maximally compressed the observations" does not follow from the model being smaller.

        No other function can be smaller than the actual function that generates the observations. Correct answers indicates that "intelligence" reproduced the model in order to make a successful prediction, ergo "maximal compression".

        • YeGoblynQueenne 3 days ago

          >> No other function can be smaller than the actual function that generates the observations.

          You mean if I write a shitty program that takes the string "hello", copies it 10 million times and stores the copies, and then generates a single copy any time I run it, there's no algorithm that can achieve a better compression than that because "no other function can be smaller than the actual function that generates the observations"?

          I suggest, before coming up with fancy theories about compression and intelligence, that you read a bit about computer science, first. For instance, read up on information theory, and maybe about Kolmogorov complexity, the abstractions that people who swear that "compression <something something> intelligence" are basing their ideas on.

          • naasking 3 days ago

            I'm very familiar with Solomonoff Induction, thanks. Are you going to answer my question or make more bad faith tangents to avoid committing to an answer?

            • YeGoblynQueenne 2 days ago

              An answer to what? To some half-baked theory you dreamed up just to sound deep and insightful about intelligence? If you think you have some insight about intelligence go do something useful with it, I don't know, try to publish a paper.

              Or, you know what? I've had it up to here with all the wild-ass swivel-eyed loon theories about what intelligence is and isn't. if you really think you know what intelligence is all about why don't you write it up as a program and run it on a computer, and show us all that your theory is right? Go ahead and create intelligence. Intelligence is compression? Go compress stuff and make it intelligent or whatever it is you think works.

              Put your money where your mouth is. Until then don't accuse others of bad faith.