There’s something odd about their Track 1 results:
Flan-base gets an alright score, but Flan-large gets 0.0. That seems very fishy as it sounds this is pretty much a multiple choice kind of task. I would not expect a score less than 1/N choices for models with no physical reasoning
Just because the question is multiple choice doesn't mean an LLM without some imposed constraint (grammars, etc.) will respond in a way which respects the multiple choice nature of the question. Going of on a tangent in response would seem to be non-agreement.
Looks very thorough, definitely a step up from that paper that got a lot of publicity, and was based on a single convoluted task. I think an approach like this is the future for benchmarking LLM's, in general.
A minor nitpick here is that there is no human baseline to compare. How would an average human perform on these tests?
From what I understand this is creating abstract "rules" and statements that can instantiated into question and answer datasets to train future versions of a model.
Some people are having a hard time understanding that the knowledge embedded in language and culture includes much of the skills that we understand as reasoning and inference.
The line of thought is that statistical predictions of language normality cannot possibly equal reasoning, but that ignores the capacity of a metascale n-dimensional matrix to encode much, much more than merely words and their statistical frequency of appearance.
Much like a table of sines and cosines, computation itself can be encoded in data. LLMs allow us to extract this data.
And there are already priests of this metascale n-dimensional matrix cult. Their preachers are full of mathy words and sound very convincing to laymen.
It’s not really complicated or mathy at all. If you want to understand how something really complex like an arbitrarily detailed simulation of the world, for example, can exist solely in data, just imagine an arbitrarily large and detailed 3d choose your own adventure book.
And of course a huge body of human intellectual product is going to imbed a lot more than just the words themselves. To think otherwise is to imagine that there is no meaning in language except the words themselves. Insofar as reasoning can be accomplished in language, or ideas represented, they are encoded in the arrangement and sequence of the words.
LLMs merely allow us to access this imbedded information in a convenient conversational format. No hand wavey magic needed.
There’s something odd about their Track 1 results:
Flan-base gets an alright score, but Flan-large gets 0.0. That seems very fishy as it sounds this is pretty much a multiple choice kind of task. I would not expect a score less than 1/N choices for models with no physical reasoning
Just because the question is multiple choice doesn't mean an LLM without some imposed constraint (grammars, etc.) will respond in a way which respects the multiple choice nature of the question. Going of on a tangent in response would seem to be non-agreement.
Looks very thorough, definitely a step up from that paper that got a lot of publicity, and was based on a single convoluted task. I think an approach like this is the future for benchmarking LLM's, in general.
A minor nitpick here is that there is no human baseline to compare. How would an average human perform on these tests?
Note to self: sit down and compare the strengths and weaknesses of this approach with a classic GOFAI system like SHRDLU.
From what I understand this is creating abstract "rules" and statements that can instantiated into question and answer datasets to train future versions of a model.
This is the modern version of stone age savages worshipping a rock.
Why do you say that?
Some people are having a hard time understanding that the knowledge embedded in language and culture includes much of the skills that we understand as reasoning and inference.
The line of thought is that statistical predictions of language normality cannot possibly equal reasoning, but that ignores the capacity of a metascale n-dimensional matrix to encode much, much more than merely words and their statistical frequency of appearance.
Much like a table of sines and cosines, computation itself can be encoded in data. LLMs allow us to extract this data.
And there are already priests of this metascale n-dimensional matrix cult. Their preachers are full of mathy words and sound very convincing to laymen.
It’s not really complicated or mathy at all. If you want to understand how something really complex like an arbitrarily detailed simulation of the world, for example, can exist solely in data, just imagine an arbitrarily large and detailed 3d choose your own adventure book.
And of course a huge body of human intellectual product is going to imbed a lot more than just the words themselves. To think otherwise is to imagine that there is no meaning in language except the words themselves. Insofar as reasoning can be accomplished in language, or ideas represented, they are encoded in the arrangement and sequence of the words.
LLMs merely allow us to access this imbedded information in a convenient conversational format. No hand wavey magic needed.