normal
y = f(x)
prompt injection / adversarial example (same thing really)
bad_y = f(x+badness)
tweak badness enough you will get bad outputs. no matter the defences.
the only ways to fully “fix” it ie to make prompt injection never possible
1. don’t use ai
2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.
otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.
> tweak badness enough
assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?
> the only way to fix ...
the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion
also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux
> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?
Clearly nothing so complicated is required, given the prompt in the very article you are commenting on.
> the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion
Yeah and the halting problem is hard too, but there's levels to this shit.
> also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux
I would argue we don't even know the desired output for most inputs for an LLM and they certainly aren't trained on every possible input state. But I think Linux and LLMs are sufficient different that they aren't really directly comparable like this. After all, Linux is not a pure function and has lots of side effects.
But just to establish an order of magnitude: the input space for ChatGPT 3.0 was 2,048 tokens long. There were 50,257 tokens in the vocabulary. The input space thus has 50,257^(2048) unique states, which is approximately equal to 1.12 × 10^9628. That's an awful big input space for a single function.
> clearly nothing ... is required
this isn't even prompt injection; even if it was, how do you go from "exists" to "for all"?
> we don't know the desired output
then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?
> linux is not a pure function ...
which is my point -- it's worse
> to establish an order of magnitude
and for linux?
> this isn't even prompt injection; even if it was, how do you go from "exists" to "for all"?
Yes it is, and nice backtrack in the same sentence there. I've laid out plenty of evidence here so far, it's your turn to start thinking. We'll try the Socratic method.
Given that every LLM seen so far has been vulnerable to prompt injection attacks, what is your possible basis for thinking that one can be made immune from them? I'm going from "multiple attacks of this type exist for all know models, and the attacks exploit a known weakness in the design" to "therefore all LLMs are susceptible to this attack".
You're going from "an attack exists for all know models" to "it's definitely possible to build an LLM that is immune from this attack". That's a much larger leap, so show the logic backing your assertion.
> then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?
You are the one asserting that input/output mappings existed for the entire space, not me.
>> linux is not a pure function ...
> which is my point -- it's worse
What, is this your first year in CS? No useful system can be a pure function. Side effects are work, if your function doesn't have a side effect, it does no work. Any system that uses an LLM to attempt work will have side effects - they may even include bombing an elementary school in Iran.
>> to establish an order of magnitude
> and for linux?
I've done all the thinking and all the research in this conversation so far, and I even specifically explained that you can't measure state space for a stateful function in a comparable way to a pure function. Clearly you didn't understand that, so if you want to force the comparison you can start adding up the state space for the linux kernel. Start with the spaces that are covered by tests, valid items include syscalls, registers, hardware interupts, etc.
Invalid spaces include doing something intentionally stupid like using the entire size of the ram or the space on the hard disk, since those are accessed on demand and not - like in an llm - all added together and fed into a blender everytime a syscall is made.
> yes it is
agree to disagree
> every LLM has been vulnerable
and every OS had bugs
> show the logic
https://arxiv.org/pdf/1912.10077
> you are the one asserting mappings existed
I know? that's why I'm asking?
> no useful system can be a pure function
why not? surely you can describe useful systems with qm? evolution operator of a closed system seems pretty pure to me
it's almost as if you could reformulate anything such that the state was one of the arguments of the function
> you can start adding up the state space for the linux kernel
I can give you a lower bound -- (your estimate for LLMs)*2, as you could imagine state "running two instances of llama-cpp"
1) You’re still wrong, this is prompt injection.
2) You continue to have basic misunderstandings of the issue. That bugs exist in other things does not mean a core design flaw in LLMs can magically be fixed.
3) https://arxiv.org/pdf/1912.10077
This paper doesn’t have any bearing to the question of the separation of user and command data in LLMs. Did you even bother to look at it?
4) Hey you’re the one that made the claim. If you can't event remember why, I can’t help you.
5) Because the world is stateful.
6) Wow so you just decided to add up all the ram after all, huh? If you want to play stupid, like you can’t understand why a real-world linux distribution is stateful while an ideal LLM isn’t, then we can play stupid.
By the broken logic you are trying to apply here, the state space of chatGPT includes the VRAM of all 10,000 GPUs your query runs across. It includes the memory in your computer, it includes the stack of the js interpreter in your browser, it includes the linux kernel itself that all those servers are running on, and so on.
3) do you really not see how UAT is relevant to existence of a model with given properties?
6) so you think an OS is somehow a subsystem of software running on top of it?
I'm kinda tired of this; you were mostly not wrong in the beginning, but now you're acting like I'm trying to attack you