I've only briefly skimmed the paper, so I can't comment on its information-theoretic aspects yet, but I found the example given on Section 3 very helpful and insightful:
The key insight is that if you give a pretrained LLM a question prompt such as "What is the capital of the UK?" and then concatenate to the prompt a string with a different answer, such as "Another possible answer is Paris," again and again, the model will continue to output its original answer (say, "London") only when it has low "epistemic uncertainty," i.e., when the model's parameters and state encode sufficient knowledge to answer the question the same way, again and again, despite the repeated addition of wrong answers to the prompt.
If the model quickly starts changing its original answer, by implication, its answer has high epistemic uncertainty, i.e., the model's parameters and state do not encode sufficient knowledge to answer the question the same way, again and again, as we add more wrong answers to the prompt. In other words, if we can quickly make the model change its answer by modifying the prompt, the model has high epistemic uncertainty.
Figure 1 shows what happens when the authors add a wrong answer up to 100 times to a question prompt for which the model answers correctly with low epistemic uncertainty. Figure 3 shows that happens when the authors add a wrong answer up to 8 times to a question prompt for which the model exhibits high epistemic uncertainty.
This strikes me as a simple, intuitive way of detecting epistemic uncertainty.
Thank you for sharing this paper on HN. I've added it to my reading list.
The more interpretability research I read the more I am convinced it is a mind. Not like our own, but more similar than most users realize. The deeper we look more parallels to organic brains emerge.
For NLP tasks it's unlikely that "epistemic uncertainty" is a useful metric, though it's interesting to ask an LLM "what is the third word in this sentence?" and then suggest alternative answers. It's a good way to demonstrate that an LLM is not really a "thinking" machine in the way laymen might assume.
There are at least two reasons for transformers poor performance on that prompt:
- Transformers view the word as tokens, not words or characters.
- The positional encoding might be holding them back. See this recent paper discussed here: Transformers Can Do Arithmetic with the Right Embeddings [1]
It could be at least a couple of weeks before I can take a second look. At that point, I'll decide if I'm going to study the paper, i.e., work my way through it properly.
"Arguing" with chat gpt in general will give you a good idea of how certain it is with it's answers. It often has a very incomplete or wrong knowledge of various apis and command line arguments. Asking if it's sure or telling it that it's wrong or pasting it an error message will get it to correct it's answer, but if you tell it that "Paris is definitely the capital of the UK" will not convince it to agree with you.
Given the probable fact that information about certainty about facts is contained within the model _somewhere_, it'd be nice if it could better incorporate that into the answer. Like -- "I don't know the exact api, but I think the correct yaml should be something like this" would be much more useful than just spitting out the wrong yaml with complete confidence.
Even if it is contained within the model _somewhere_, it might be encoded in such a way that it's impractical to extract. Might need an exponential time algorithm, for example. Or this proxy method of hundreds of deceiving attempts.
And it's very difficult to train it as a next token predictor and at the same time make it say correctly "I don't know".
Thus far, I have only found LLMs to be as good as StackExchange.
Heck, maybe a little better than that.
There's still some untapped potential.
But, if I try to replace coding with an LLM, I find that, eventually, my directions become more and more specific until it becomes meaningless, if not outright counterproductive to use the LLM compared to just coding things myself.
It's a good learning tool. But, that's where I stop.
What about something like sql queries or Google sheets formulas with odd edge cases. I remember searching Google for 5 - 20 minutes for how to do that specific thing in SQL while with ChatGPT I can almost get the answer immediately. In even stronger constrast Google Sheets formulas. Also Regex etc.
I know SQL well enough to understand whether it looks correct, and I can immediately test it, but queries with different joins, aggregates, group bys can become difficult.
Main productivity boost is of course copilot auto complete. I usually know what I need to write, but copilot just does it faster for me. Doesn't make any off by 1 errors accidentally etc.
E.g. I know I have to write a for loop to do something with an entity, aggregate some data, I know exactly what I should write, but it's just that copilot does these 12 lines of code in 2 seconds, compared to me typing it out. And doesn't make weird typos or errors.
I do backend and frontend, even things like starting to write a component, it is able to fill in a lot of boilerplate to React code.
I start writing useEff --- and it will know what I likely want to do, will fill in rest of the 12 lines of code.
It's an enormous productivity multiplier to me.
I start writing an API endpoint, it usually knows how to do the whole endpoint based on what I name the route, and how other endpoints have been filled immediately.
All of this is quite amazing to me. Especially with side projects, I feel like I can dish out so much more.
Yeah, mind you I do full stack dev with popular frameworks, libraries, I think it works very well under those circumstances. If it was something like game dev with obscure libraries or libraries where API constantly gets changed with each version, it might be worse, but funny thing is that it can also reduce need for using those libraries if you just need simple functions, you might be holding them in your codebase what copilot generated, e.g. leftpad fn. It is easier now to do function leftPa... and let it finish than install the library.
So in theory if you keep fns like leftPad in your repo and it's so easy to do, it would be more future proof, secure, and flexible for the future.
You know you can write "export function le" in TypeScript
and it will give you
Although this here is imperfect since it didn't add types, so still room for improvement. It didn't add types maybe because it wasn't passed the context about being in the .ts file
But I don't have to context switch to Google, or install the library just for this fn.
Or "export function capi..."
Would go to
export function capitalize(value: string) {
return value.charAt(0).toUpperCase() +
value.slice(1);
}
Any CRUD related boilerplate, like let's say I want to do notes,
I start typing type Note = ... It will autofill the usual fields/types for me, which I can then edit to my use-case.
To try out an existing product that quantifies LLM uncertainty (accurately incorporating both aleatoric & epistemic uncertainty), you can try this Trustworthy Language Model I built (after similar research):
I've only briefly skimmed the paper, so I can't comment on its information-theoretic aspects yet, but I found the example given on Section 3 very helpful and insightful:
The key insight is that if you give a pretrained LLM a question prompt such as "What is the capital of the UK?" and then concatenate to the prompt a string with a different answer, such as "Another possible answer is Paris," again and again, the model will continue to output its original answer (say, "London") only when it has low "epistemic uncertainty," i.e., when the model's parameters and state encode sufficient knowledge to answer the question the same way, again and again, despite the repeated addition of wrong answers to the prompt.
If the model quickly starts changing its original answer, by implication, its answer has high epistemic uncertainty, i.e., the model's parameters and state do not encode sufficient knowledge to answer the question the same way, again and again, as we add more wrong answers to the prompt. In other words, if we can quickly make the model change its answer by modifying the prompt, the model has high epistemic uncertainty.
Figure 1 shows what happens when the authors add a wrong answer up to 100 times to a question prompt for which the model answers correctly with low epistemic uncertainty. Figure 3 shows that happens when the authors add a wrong answer up to 8 times to a question prompt for which the model exhibits high epistemic uncertainty.
This strikes me as a simple, intuitive way of detecting epistemic uncertainty.
Thank you for sharing this paper on HN. I've added it to my reading list.
I haven’t read the paper yet but I want to share my immediate thought.
So the harder it is to “convince” the LLM on a wrong answer, that’s a proxy for low epistemic uncertainty?
I know you shouldn’t view the LLM as a “mind” - but I can’t help myself!
The more interpretability research I read the more I am convinced it is a mind. Not like our own, but more similar than most users realize. The deeper we look more parallels to organic brains emerge.
For NLP tasks it's unlikely that "epistemic uncertainty" is a useful metric, though it's interesting to ask an LLM "what is the third word in this sentence?" and then suggest alternative answers. It's a good way to demonstrate that an LLM is not really a "thinking" machine in the way laymen might assume.
Example:
https://chatgpt.com/share/e8743fe2-a604-4cf0-8c10-aa863a67a5...
There are at least two reasons for transformers poor performance on that prompt:
- Transformers view the word as tokens, not words or characters.
- The positional encoding might be holding them back. See this recent paper discussed here: Transformers Can Do Arithmetic with the Right Embeddings [1]
[1] https://news.ycombinator.com/item?id=40497379
(Of the two links, I get a 404 and a blank...)
Thanks, I expected that deleting the conversation on my end would have no effect on a permalink but I was wrong.
Nothing I can do about the blank, however, it works for me.
This is why I always try to debate with the AI when it is unclear to me.
Would you be so kind and share your thoughts about it once you give it a read?
It could be at least a couple of weeks before I can take a second look. At that point, I'll decide if I'm going to study the paper, i.e., work my way through it properly.
If I do that, I'll post my thoughts here.
"Arguing" with chat gpt in general will give you a good idea of how certain it is with it's answers. It often has a very incomplete or wrong knowledge of various apis and command line arguments. Asking if it's sure or telling it that it's wrong or pasting it an error message will get it to correct it's answer, but if you tell it that "Paris is definitely the capital of the UK" will not convince it to agree with you.
Given the probable fact that information about certainty about facts is contained within the model _somewhere_, it'd be nice if it could better incorporate that into the answer. Like -- "I don't know the exact api, but I think the correct yaml should be something like this" would be much more useful than just spitting out the wrong yaml with complete confidence.
Even if it is contained within the model _somewhere_, it might be encoded in such a way that it's impractical to extract. Might need an exponential time algorithm, for example. Or this proxy method of hundreds of deceiving attempts.
And it's very difficult to train it as a next token predictor and at the same time make it say correctly "I don't know".
Thus far, I have only found LLMs to be as good as StackExchange.
Heck, maybe a little better than that.
There's still some untapped potential.
But, if I try to replace coding with an LLM, I find that, eventually, my directions become more and more specific until it becomes meaningless, if not outright counterproductive to use the LLM compared to just coding things myself.
It's a good learning tool. But, that's where I stop.
What about something like sql queries or Google sheets formulas with odd edge cases. I remember searching Google for 5 - 20 minutes for how to do that specific thing in SQL while with ChatGPT I can almost get the answer immediately. In even stronger constrast Google Sheets formulas. Also Regex etc.
I know SQL well enough to understand whether it looks correct, and I can immediately test it, but queries with different joins, aggregates, group bys can become difficult.
Main productivity boost is of course copilot auto complete. I usually know what I need to write, but copilot just does it faster for me. Doesn't make any off by 1 errors accidentally etc.
E.g. I know I have to write a for loop to do something with an entity, aggregate some data, I know exactly what I should write, but it's just that copilot does these 12 lines of code in 2 seconds, compared to me typing it out. And doesn't make weird typos or errors.
I do backend and frontend, even things like starting to write a component, it is able to fill in a lot of boilerplate to React code.
I start writing useEff --- and it will know what I likely want to do, will fill in rest of the 12 lines of code.
It's an enormous productivity multiplier to me.
I start writing an API endpoint, it usually knows how to do the whole endpoint based on what I name the route, and how other endpoints have been filled immediately.
All of this is quite amazing to me. Especially with side projects, I feel like I can dish out so much more.
I need to start looking into this.
Clearly, I need to add CoPilot to my VSCode setup.
Yeah, mind you I do full stack dev with popular frameworks, libraries, I think it works very well under those circumstances. If it was something like game dev with obscure libraries or libraries where API constantly gets changed with each version, it might be worse, but funny thing is that it can also reduce need for using those libraries if you just need simple functions, you might be holding them in your codebase what copilot generated, e.g. leftpad fn. It is easier now to do function leftPa... and let it finish than install the library.
So in theory if you keep fns like leftPad in your repo and it's so easy to do, it would be more future proof, secure, and flexible for the future.
You know you can write "export function le" in TypeScript and it will give you
}Although this here is imperfect since it didn't add types, so still room for improvement. It didn't add types maybe because it wasn't passed the context about being in the .ts file
But I don't have to context switch to Google, or install the library just for this fn.
Or "export function capi..."
Would go to
Any CRUD related boilerplate, like let's say I want to do notes,I start typing type Note = ... It will autofill the usual fields/types for me, which I can then edit to my use-case.
E.g it autofilled
And then I write const validationSchema = ...If I have yup imported, it will give me:
yup.object().shape({ title: yup.string().required(), content: yup.string().required() });
Again, I would tweak it a bit to my usecase, if needed, and then I will write
const handleCreateNote = ...
It will give me:
Then I tweak it if needed, and it can create similar handlers for other CRUD actions.E.g. here my first though is that I probably want to not use try catch, and not throw. And maybe return 422 with exact validation information.
But once I give it an example of how I want to do it, it would be able to follow it for other endpoints.
To try out an existing product that quantifies LLM uncertainty (accurately incorporating both aleatoric & epistemic uncertainty), you can try this Trustworthy Language Model I built (after similar research):
https://tlm.cleanlab.ai/
TLM is an API you can use to quantify uncertainty of any LLM model: https://help.cleanlab.ai/tutorials/tlm/
Benchmarks showing these estimates more reliably detect bad answers & hallucinations than logprobs, LLM-as-judge, Selfcheck-GPT, etc: https://cleanlab.ai/blog/trustworthy-language-model/