The day can’t come fast enough where we just see things like this as trivial misuse of the tool— like using a hammer to drive in a screw. We use the hammer for nails and the screwdriver for screws. We use LLM for exploring data with language, and our brains for reasoning.
> We use LLM for exploring data with language, and our brains for reasoning
I know this is very cynical of me, but I am becoming very convinced that this is actually the biggest draw to AI for a lot of people
They don't want to use their brains
People talk about how it can summarize long text for them. This is framed as being a time saver, but I'm positive for a lot of people they just don't want to read the full text
It can generate images for people who don't want to learn how to draw, or save them money by not hiring artists
> This is framed as being a time saver, but I'm positive for a lot of people they just don't want to read the full text
Yes, those can be very much the same thing and it does not mean that people don't want to use their brains. If I encounter a long article and have serious doubts as to whether it will be worth my time, reading a summary first helps a lot.
In fact, now that we're on the subject: _Proper_ journalism is _supposed_ to provide the key points at the beginning of the text (as opposed to some silly information free nonsense meant to 'set the scene' or 'hook you in'). Don't bury the lead/lede. See https://en.wikipedia.org/wiki/Inverted_pyramid_(journalism)
Half of the value of reading things via HN is seeing the top comment be a summary of the key points of the linked content. It's very, very valuable in this day and age where good writing has taken a back seat to making money.
The only proper and actually useful use-case for LLM-summarized text I've seen so far was the one I first saw used in ChatGPT's UI itself. All the rest, including Apple's notifications, was so obviously problematic that it smells "investment bubble" from a thousand miles.
Yes. Please. Stop. AI summaries are often terrible and miss anything resembling a subtle point. Generating long form from a summary is just literally packing the summary with obvious information and made up bullshit.
LLMs generate text, not knowledge. They are great for parsing human culture… but not good at thinking.
I know my example is very contrived, I wasn’t trying very hard, just went with the first thing that came to mind.
> LLMs are much more versatile and capable of performing a wide range of tasks. Yet, at the same time, their capabilities are ill defined.
That’s my point, I want to skip to the part where we know what LLMs are good for, what they are bad for, and just consider them another tool at our disposal. We’re still in the phase of throwing shit at the wall to see what sticks, and it is exhausting more often than not.
We are now seeing the LLM is used to parse the question and retrieve information from elsewhere.
Before you would ask the LLM who the president of the US was and the LLM would autocomplete. Now the LLM constructs a query through a tool and searches the internet for an answer.
It parsed the entire internet to have enough data to learn about language, but you don't necessarily want to depend on what it learned, other than to parse the syntax of the user.
> That’s my point, I want to skip to the part where we know what LLMs are good for, what they are bad for, and just consider them another tool at our disposal.
Wish there was a sober source of discussion around this, alas we're bombarded with "AGI is here" articles from the NYT and "it's so over for programmers" 24/7.
Hammers are NOT designed with a specific purpose, they are just big heavy metal things with a handle for leverage.
Similarly LLMs are a thing that turned out to be useful and we end up looking foru usecases for it.
Similar to the YC analogy of the company that discovers a brick and they have to find out useful ways to use it: To put out fires, to hit people in the head, etc..
Claude does ask questions for clarification or asks me to provide something it does not know though, at least it has happened many times to me. At other times I will have to ask if it needs X or Y to be able to answer more accurately, although this may be the same case with other LLMs, too. The former though was quite a surprise to me, coming from GPT.
I am working on a pet project, using tactile "premium" 4/5-way switches in super-ergonomic form-factor keyboard (initially like the logitech vertical mouse, but that turned out awful). The only model to not get hung up on Cherry MX and hallucinate 4-way cherry switches has been Claude (the others did make attempts at other manufacturers, but hallucinated part numbers). It is significantly ahead of the competition.
On this topic, SimpleQA benchmark has a component measuring hallucination rate vs ”know” vs ”don’t know”. OpenAI models have often been more troubled than the rest. See also, from the paper: https://imgur.com/7NDZ0ON (you want a low ”Incorrect” score as it’s an attempted answer, but wrong)
I wish hallucination benchmarks were far more popular.
On 3.7 Sonny in "slow mode", still give me some seriously deceptive answers, it can be, even great, so good I've grown to over trust it and it's smashed me a few times.
I notice it a lot when coding with static typed languages, you paste your code and it will tell you that you were "deceived" very quickly.
It gives extremely confident but wrong answers, I've found it to just be way more convincing.
I have only used 3.7 Sonnet for programming, and mostly with what I already know (apart from 1-2 obscure languages) but I always fed it their docs. It still got some things wrong and took a couple reiteration though, but GPT performed much much worse in this regard to the point of being useless while with 3.7 Sonnet I wrote functioning stuff in those obscure languages.
I wonder how much of this is an inherent problem which is hard to work a solution into vs "confidently guessing the answer every time yields a +x% gain for a modelon all of the other benchmark results so nobody wants to reward opposite of that".
I use copilot every day and every day I'm more and more convinced that LLMs aren't going to rule the world but will continue to be "just" neat autocomplete tools whose utility degrades the more you expect from them.
Here's an actual sentence I typed yesterday: "the previous three answers you gave me were hallucinations and i'm skeptical, so confirm that this answer is not another one." But then it actually gave me a different (5th) answer that was useful, and it's not clear that reading the docs would have been faster.
same. i was trying to do something random with java generics today.
i got 3 wrong answers in a row (that i could easily confirm were wrong by compiling)
then the 4th worked. it was much faster than reading the jvm spec about wildcard generic subtyping relation (something ive read before but couldn't quote) and it taught me something i didn't know even though it was wrong
I know I don't know regex. And if you ask me for a specific regex transform, I'm not going to give you any, I'm going to tell you I don't know. Try that with an LLM.
Hmm yes but you know that regex exists, and therefore know enough about it that you know you don't know it. That's how I interpreted this.
I was thinking more like you don't know that you don't know poop.js because you didn't know about it until I said it just now. Otherwise you'd have continued blissful in your unaware-ness of poop.. AKA you didn't know that you didn't know it.
Is it that they're over confident, or that we are over confident in their responses also.
LLM's aren't an all knowing power, much like ourselves, but we still take the opinions and ideas of others as true to some extent.
If you are using LLM's and taking their outputs as complete truths or working products, then you're not using them correctly to begin with. You need to exercise a degree of professional and technical skepticism with their outputs.
Luckily LLM's are moving into the arena of being able to reason with themselves and test their assumptions before giving us an answer.
LLM's can push me in the wrong direction just as much as an answer to a problem on a forum.
to quote someone else: "at least when I ask an intern to find something they'll usually tell me they don't know and then flail around; AI will just lie with full confidence to my face"
I'm not surprised either. I see this as another example of LLMs' emulating human behavior. I've met way too many people that refuse to admit they didn't know something (he says while looking in the mirror)
I consider this to be a solved problem. Reasoning models are exceptionally good at this. In fact, if you use ChatGPT with Deep Research, it can bug you with questions to the point of annoyance!
Could have also been the fact that my custom GPT instructions included stuff like “ALWAYS clarify something if you don’t understand. Do not assume!”
I find most articles of the sort "LLMs have this flaw" to be of a cynical one-upmanship kind.
"If you say please LLMs think you are a grandma". Well then don't say you are a grandma. At this point we have a rough idea of what these things are, what their limitations are, people are using them to great effect in very different areas, their objective is usually to hack the LLM into doing useful stuff, while the article writers are hacking the LLM into doing stuff that is wrong.
If a group of guys is making applications with an LLM and another dude is making shit applications with the LLM, am I supposed to be surprised at the latter instead of the former? Anyone can do an LLM do weird shit, the skill and area of interest is in the former.
The day can’t come fast enough where we just see things like this as trivial misuse of the tool— like using a hammer to drive in a screw. We use the hammer for nails and the screwdriver for screws. We use LLM for exploring data with language, and our brains for reasoning.
> We use LLM for exploring data with language, and our brains for reasoning
I know this is very cynical of me, but I am becoming very convinced that this is actually the biggest draw to AI for a lot of people
They don't want to use their brains
People talk about how it can summarize long text for them. This is framed as being a time saver, but I'm positive for a lot of people they just don't want to read the full text
It can generate images for people who don't want to learn how to draw, or save them money by not hiring artists
> This is framed as being a time saver, but I'm positive for a lot of people they just don't want to read the full text
Yes, those can be very much the same thing and it does not mean that people don't want to use their brains. If I encounter a long article and have serious doubts as to whether it will be worth my time, reading a summary first helps a lot.
In fact, now that we're on the subject: _Proper_ journalism is _supposed_ to provide the key points at the beginning of the text (as opposed to some silly information free nonsense meant to 'set the scene' or 'hook you in'). Don't bury the lead/lede. See https://en.wikipedia.org/wiki/Inverted_pyramid_(journalism)
Half of the value of reading things via HN is seeing the top comment be a summary of the key points of the linked content. It's very, very valuable in this day and age where good writing has taken a back seat to making money.
If you're generating long text to send to people and summarizing long text that people send to you, you're just wasting other people's time.
They also talk about democratizing art- are they using LLMs' probably vast corpus of art feedback to improve their own work? Well, no.
The only proper and actually useful use-case for LLM-summarized text I've seen so far was the one I first saw used in ChatGPT's UI itself. All the rest, including Apple's notifications, was so obviously problematic that it smells "investment bubble" from a thousand miles.
Yes. Please. Stop. AI summaries are often terrible and miss anything resembling a subtle point. Generating long form from a summary is just literally packing the summary with obvious information and made up bullshit.
LLMs generate text, not knowledge. They are great for parsing human culture… but not good at thinking.
I just think of LLMs as “what if my uncle Steve went to college” because it’s like that. And if I’m using quants it’s q5kM =1 beer. Q4=6 beers
Still, drunk , educated uncle Steve is pretty handy sometimes.
> We use the hammer for nails and the screwdriver for screws
The difference is, the hammers and screwdrivers perform a single task, and have been designed and optimised for that specific task.
LLMs are much more versatile and capable of performing a wide range of tasks. Yet, at the same time, their capabilities are ill defined.
I know my example is very contrived, I wasn’t trying very hard, just went with the first thing that came to mind.
> LLMs are much more versatile and capable of performing a wide range of tasks. Yet, at the same time, their capabilities are ill defined.
That’s my point, I want to skip to the part where we know what LLMs are good for, what they are bad for, and just consider them another tool at our disposal. We’re still in the phase of throwing shit at the wall to see what sticks, and it is exhausting more often than not.
My take is:
GOOD: Language parsing.
BAD: Information retrieval.
We are now seeing the LLM is used to parse the question and retrieve information from elsewhere.
Before you would ask the LLM who the president of the US was and the LLM would autocomplete. Now the LLM constructs a query through a tool and searches the internet for an answer.
It parsed the entire internet to have enough data to learn about language, but you don't necessarily want to depend on what it learned, other than to parse the syntax of the user.
> That’s my point, I want to skip to the part where we know what LLMs are good for, what they are bad for, and just consider them another tool at our disposal.
Totally agree with that.
Wish there was a sober source of discussion around this, alas we're bombarded with "AGI is here" articles from the NYT and "it's so over for programmers" 24/7.
Hammers are NOT designed with a specific purpose, they are just big heavy metal things with a handle for leverage.
Similarly LLMs are a thing that turned out to be useful and we end up looking foru usecases for it.
Similar to the YC analogy of the company that discovers a brick and they have to find out useful ways to use it: To put out fires, to hit people in the head, etc..
There are myriad specific hammers and each is designed for specialized use: https://www.theengineerspost.com/types-of-hammers/
Those are fine tuned
> Hammers are NOT designed with a specific purpose, they are just big heavy metal things with a handle for leverage.
Well they’re most certainly not the right tool to fasten screws with :P
> We use LLM for exploring data with language
That seems problematic, too.
https://en.wikipedia.org/wiki/HARKing
Claude does ask questions for clarification or asks me to provide something it does not know though, at least it has happened many times to me. At other times I will have to ask if it needs X or Y to be able to answer more accurately, although this may be the same case with other LLMs, too. The former though was quite a surprise to me, coming from GPT.
I am working on a pet project, using tactile "premium" 4/5-way switches in super-ergonomic form-factor keyboard (initially like the logitech vertical mouse, but that turned out awful). The only model to not get hung up on Cherry MX and hallucinate 4-way cherry switches has been Claude (the others did make attempts at other manufacturers, but hallucinated part numbers). It is significantly ahead of the competition.
On this topic, SimpleQA benchmark has a component measuring hallucination rate vs ”know” vs ”don’t know”. OpenAI models have often been more troubled than the rest. See also, from the paper: https://imgur.com/7NDZ0ON (you want a low ”Incorrect” score as it’s an attempted answer, but wrong)
I wish hallucination benchmarks were far more popular.
On 3.7 Sonny in "slow mode", still give me some seriously deceptive answers, it can be, even great, so good I've grown to over trust it and it's smashed me a few times.
I notice it a lot when coding with static typed languages, you paste your code and it will tell you that you were "deceived" very quickly.
It gives extremely confident but wrong answers, I've found it to just be way more convincing.
I have only used 3.7 Sonnet for programming, and mostly with what I already know (apart from 1-2 obscure languages) but I always fed it their docs. It still got some things wrong and took a couple reiteration though, but GPT performed much much worse in this regard to the point of being useless while with 3.7 Sonnet I wrote functioning stuff in those obscure languages.
Ah, interesting - I’ve not had much experience with Claude, will give it a go. Thanks.
I would suggest that LLMs don't actually know anything. The knowing is inferred.
An LLM might be seen as a kind of very elaborate linguistic hoax (at least as far as knowledge and intelligence are concerned).
And I like LLMs, don't get me wrong. I'm not a hater.
Deep latent representations aren’t a hoax
The map isn't the territory
Saying LLMs know something might be seen as false.
To knowingly advance something false as true, is a hoax.
You know?
I wonder how much of this is an inherent problem which is hard to work a solution into vs "confidently guessing the answer every time yields a +x% gain for a modelon all of the other benchmark results so nobody wants to reward opposite of that".
I use copilot every day and every day I'm more and more convinced that LLMs aren't going to rule the world but will continue to be "just" neat autocomplete tools whose utility degrades the more you expect from them.
Here's an actual sentence I typed yesterday: "the previous three answers you gave me were hallucinations and i'm skeptical, so confirm that this answer is not another one." But then it actually gave me a different (5th) answer that was useful, and it's not clear that reading the docs would have been faster.
same. i was trying to do something random with java generics today.
i got 3 wrong answers in a row (that i could easily confirm were wrong by compiling)
then the 4th worked. it was much faster than reading the jvm spec about wildcard generic subtyping relation (something ive read before but couldn't quote) and it taught me something i didn't know even though it was wrong
Well humans don't know what they don't know either. I think the bigger problem is that LLMs don't know what they do know.
I know I don't know regex. And if you ask me for a specific regex transform, I'm not going to give you any, I'm going to tell you I don't know. Try that with an LLM.
Hmm yes but you know that regex exists, and therefore know enough about it that you know you don't know it. That's how I interpreted this.
I was thinking more like you don't know that you don't know poop.js because you didn't know about it until I said it just now. Otherwise you'd have continued blissful in your unaware-ness of poop.. AKA you didn't know that you didn't know it.
GPT4 frequently doesn't know what it just said, or can't act on it if asked.
The smarter the human is the more it knows how little it knows.
Most of us know the map of unknowns. Only grifters use unknowns as indistinguishable from magic.
Is it that they're over confident, or that we are over confident in their responses also.
LLM's aren't an all knowing power, much like ourselves, but we still take the opinions and ideas of others as true to some extent.
If you are using LLM's and taking their outputs as complete truths or working products, then you're not using them correctly to begin with. You need to exercise a degree of professional and technical skepticism with their outputs.
Luckily LLM's are moving into the arena of being able to reason with themselves and test their assumptions before giving us an answer.
LLM's can push me in the wrong direction just as much as an answer to a problem on a forum.
LLMs don't know period. They can be useful to summarize well and redundantly publicized information, but they don't "know" even that.
to quote someone else: "at least when I ask an intern to find something they'll usually tell me they don't know and then flail around; AI will just lie with full confidence to my face"
> AI will just lie with full confidence to my face
So, A.I. would make an excellent politician or used car salesman then... ;)
great quote - exactly that
LLMs learn from the internet. Refuse to admit they don't know something. I have to admit I'm not entirely surprised by this.
No, I’m not surprised either.
In fact, I’m much more surprised at just how capable their are of such a wide range of task, given that they have just ‘learnt from the internet’!
I'm not surprised either. I see this as another example of LLMs' emulating human behavior. I've met way too many people that refuse to admit they didn't know something (he says while looking in the mirror)
I consider this to be a solved problem. Reasoning models are exceptionally good at this. In fact, if you use ChatGPT with Deep Research, it can bug you with questions to the point of annoyance!
Could have also been the fact that my custom GPT instructions included stuff like “ALWAYS clarify something if you don’t understand. Do not assume!”
I don’t know if it’s totally solved but it’ll we are definitely making progress on it.
Most of the comments on this thread are needlessly pessimistic
Openai's own tests show an almost 40% hallucination rate. Hardly a solved problem.
I find most articles of the sort "LLMs have this flaw" to be of a cynical one-upmanship kind.
"If you say please LLMs think you are a grandma". Well then don't say you are a grandma. At this point we have a rough idea of what these things are, what their limitations are, people are using them to great effect in very different areas, their objective is usually to hack the LLM into doing useful stuff, while the article writers are hacking the LLM into doing stuff that is wrong.
If a group of guys is making applications with an LLM and another dude is making shit applications with the LLM, am I supposed to be surprised at the latter instead of the former? Anyone can do an LLM do weird shit, the skill and area of interest is in the former.