To Believe or Not Believe Your LLM

58 points by josh-sematic 2 years ago

cs702 2 years ago

I've only briefly skimmed the paper, so I can't comment on its information-theoretic aspects yet, but I found the example given on Section 3 very helpful and insightful:

The key insight is that if you give a pretrained LLM a question prompt such as "What is the capital of the UK?" and then concatenate to the prompt a string with a different answer, such as "Another possible answer is Paris," again and again, the model will continue to output its original answer (say, "London") only when it has low "epistemic uncertainty," i.e., when the model's parameters and state encode sufficient knowledge to answer the question the same way, again and again, despite the repeated addition of wrong answers to the prompt.

If the model quickly starts changing its original answer, by implication, its answer has high epistemic uncertainty, i.e., the model's parameters and state do not encode sufficient knowledge to answer the question the same way, again and again, as we add more wrong answers to the prompt. In other words, if we can quickly make the model change its answer by modifying the prompt, the model has high epistemic uncertainty.

Figure 1 shows what happens when the authors add a wrong answer up to 100 times to a question prompt for which the model answers correctly with low epistemic uncertainty. Figure 3 shows that happens when the authors add a wrong answer up to 8 times to a question prompt for which the model exhibits high epistemic uncertainty.

This strikes me as a simple, intuitive way of detecting epistemic uncertainty.

Thank you for sharing this paper on HN. I've added it to my reading list.

maCDzP 2 years ago

I haven’t read the paper yet but I want to share my immediate thought.
So the harder it is to “convince” the LLM on a wrong answer, that’s a proxy for low epistemic uncertainty?
I know you shouldn’t view the LLM as a “mind” - but I can’t help myself!
- HeatrayEnjoyer 2 years ago
  
  The more interpretability research I read the more I am convinced it is a mind. Not like our own, but more similar than most users realize. The deeper we look more parallels to organic brains emerge.
voiceblue 2 years ago

For NLP tasks it's unlikely that "epistemic uncertainty" is a useful metric, though it's interesting to ask an LLM "what is the third word in this sentence?" and then suggest alternative answers. It's a good way to demonstrate that an LLM is not really a "thinking" machine in the way laymen might assume.
Example:
https://chatgpt.com/share/e8743fe2-a604-4cf0-8c10-aa863a67a5...
- Manabu-eo 2 years ago
  
  There are at least two reasons for transformers poor performance on that prompt:
  - Transformers view the word as tokens, not words or characters.
  - The positional encoding might be holding them back. See this recent paper discussed here: Transformers Can Do Arithmetic with the Right Embeddings [1]
  [1] https://news.ycombinator.com/item?id=40497379
- mdp2021 2 years ago
  
  (Of the two links, I get a 404 and a blank...)
  
  voiceblue 2 years ago
  
  Thanks, I expected that deleting the conversation on my end would have no effect on a permalink but I was wrong.
  Nothing I can do about the blank, however, it works for me.
mewpmewp2 2 years ago

This is why I always try to debate with the AI when it is unclear to me.
3abiton 2 years ago

Would you be so kind and share your thoughts about it once you give it a read?
- cs702 2 years ago
  
  It could be at least a couple of weeks before I can take a second look. At that point, I'll decide if I'm going to study the paper, i.e., work my way through it properly.
  If I do that, I'll post my thoughts here.

empath75 2 years ago

"Arguing" with chat gpt in general will give you a good idea of how certain it is with it's answers. It often has a very incomplete or wrong knowledge of various apis and command line arguments. Asking if it's sure or telling it that it's wrong or pasting it an error message will get it to correct it's answer, but if you tell it that "Paris is definitely the capital of the UK" will not convince it to agree with you.

Given the probable fact that information about certainty about facts is contained within the model _somewhere_, it'd be nice if it could better incorporate that into the answer. Like -- "I don't know the exact api, but I think the correct yaml should be something like this" would be much more useful than just spitting out the wrong yaml with complete confidence.

Manabu-eo 2 years ago

Even if it is contained within the model _somewhere_, it might be encoded in such a way that it's impractical to extract. Might need an exponential time algorithm, for example. Or this proxy method of hundreds of deceiving attempts.
And it's very difficult to train it as a next token predictor and at the same time make it say correctly "I don't know".

simple_quest_9 2 years ago

Thus far, I have only found LLMs to be as good as StackExchange.

Heck, maybe a little better than that.

There's still some untapped potential.

But, if I try to replace coding with an LLM, I find that, eventually, my directions become more and more specific until it becomes meaningless, if not outright counterproductive to use the LLM compared to just coding things myself.

It's a good learning tool. But, that's where I stop.

mewpmewp2 2 years ago

What about something like sql queries or Google sheets formulas with odd edge cases. I remember searching Google for 5 - 20 minutes for how to do that specific thing in SQL while with ChatGPT I can almost get the answer immediately. In even stronger constrast Google Sheets formulas. Also Regex etc.
I know SQL well enough to understand whether it looks correct, and I can immediately test it, but queries with different joins, aggregates, group bys can become difficult.
Main productivity boost is of course copilot auto complete. I usually know what I need to write, but copilot just does it faster for me. Doesn't make any off by 1 errors accidentally etc.
E.g. I know I have to write a for loop to do something with an entity, aggregate some data, I know exactly what I should write, but it's just that copilot does these 12 lines of code in 2 seconds, compared to me typing it out. And doesn't make weird typos or errors.
I do backend and frontend, even things like starting to write a component, it is able to fill in a lot of boilerplate to React code.
I start writing useEff --- and it will know what I likely want to do, will fill in rest of the 12 lines of code.
It's an enormous productivity multiplier to me.
I start writing an API endpoint, it usually knows how to do the whole endpoint based on what I name the route, and how other endpoints have been filled immediately.
All of this is quite amazing to me. Especially with side projects, I feel like I can dish out so much more.
- simple_quest_9 2 years ago
  
  I need to start looking into this.
  Clearly, I need to add CoPilot to my VSCode setup.
  
  mewpmewp2 2 years ago
  
  Yeah, mind you I do full stack dev with popular frameworks, libraries, I think it works very well under those circumstances. If it was something like game dev with obscure libraries or libraries where API constantly gets changed with each version, it might be worse, but funny thing is that it can also reduce need for using those libraries if you just need simple functions, you might be holding them in your codebase what copilot generated, e.g. leftpad fn. It is easier now to do function leftPa... and let it finish than install the library.
  So in theory if you keep fns like leftPad in your repo and it's so easy to do, it would be more future proof, secure, and flexible for the future.
  You know you can write "export function le" in TypeScript and it will give you
  export function leftPad(value: string, length: number, char: string = '0') { return value.padStart(length, char);
  }
  Although this here is imperfect since it didn't add types, so still room for improvement. It didn't add types maybe because it wasn't passed the context about being in the .ts file
  But I don't have to context switch to Google, or install the library just for this fn.
  Or "export function capi..."
  Would go to
  export function capitalize(value: string) { return value.charAt(0).toUpperCase() + value.slice(1); }
  Any CRUD related boilerplate, like let's say I want to do notes,
  I start typing type Note = ... It will autofill the usual fields/types for me, which I can then edit to my use-case.
  E.g it autofilled
  type Note = { id: string; title: string; content: string; }
  And then I write const validationSchema = ...
  If I have yup imported, it will give me:
  yup.object().shape({ title: yup.string().required(), content: yup.string().required() });
  Again, I would tweak it a bit to my usecase, if needed, and then I will write
  const handleCreateNote = ...
  It will give me:
  const handleCreateNote = async (req, res) => { try { const note = req.body as Note; await validationSchema.validate(note); const createdNote = await NoteService.create(note); res.status(201).json(createdNote); } catch (error) { res.status(400).json({ error: error.message }); } }
  Then I tweak it if needed, and it can create similar handlers for other CRUD actions.
  E.g. here my first though is that I probably want to not use try catch, and not throw. And maybe return 422 with exact validation information.
  But once I give it an example of how I want to do it, it would be able to follow it for other endpoints.

_jonas 2 years ago

To try out an existing product that quantifies LLM uncertainty (accurately incorporating both aleatoric & epistemic uncertainty), you can try this Trustworthy Language Model I built (after similar research):

https://tlm.cleanlab.ai/

TLM is an API you can use to quantify uncertainty of any LLM model: https://help.cleanlab.ai/tutorials/tlm/

Benchmarks showing these estimates more reliably detect bad answers & hallucinations than logprobs, LLM-as-judge, Selfcheck-GPT, etc: https://cleanlab.ai/blog/trustworthy-language-model/