Is this not kind of trivial, at least the way they've defined it? It's kind of very obviously NP complete to me. Any of these text problems where you're tying to optimize over a whole corpus is kind of not hard to see to be NP-complete, similar to longest common subsequence.
Some narrow versions of it could be optimally solved with DP, but when the constraints are lifted, you will probably have to pay for it with either exponential memory or time.
Same applies to the Knapsack problem. You can solve some variants of it with DP, but it won't generalize.
> We still do not know, for instance, what makes a good tokeniser (Gowda and May, 2020; Cognetta et al., 2024): which characteristics should its produced subwords `s` have to be a good starting point for language modelling? If we knew this, then we could define an objective function which we could evaluate tokenisers with.
I don't see how the authors get past this true general statement from the first paragraph of the introduction. Finding a good tokenizer is not just NP-hard; we have no idea how hard it might be because we don't have theoretical agreement on what "good" means.
In order to have something to prove, the authors decide (somewhat arbitrarily):
> Specifically, we focus on finding tokenisers that maximise the compression of a text. Given this objective, we then define the tokenisation problem as the task of finding a tokeniser which compresses a dataset to at most δ symbols.
Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.
I'm also not sure how realistic the limitation to "at most δ symbols" is. I mean, that limit is undeniably useful to make the proof of NP-completeness go through, because it's a similar mechanism to the minimum number of satisfied clauses in the MAX-2-SAT definition. But why not just keep adding tokens as needed, rather than imposing any preordained limit? IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings. When that tokenizer was being designed, I don't imagine they worried much if the final number had been 60k or even 100k. How could you possibly choose a meaningful δ from first principles?
To put that point a different way, imagine the authors had proven NP-completeness by reduction from the Knapsack Problem, where the knapsack you're packing has some maximum capacity. If you can easily swap your knapsack out for a larger knapsack whenever it gets (close to) full, then the problem becomes trivial.
If the authors managed to prove that any arbitrary objective function would lead to NP-hard tokenizer optimization problem, then their result would be more general. If the paper proves that somehow, I missed it.
I suppose this paper suggests "here be dragons" in an interesting if incomplete way, but I would also say there's no need to hurt yourself with an expensive optimization problem when you're not even sure it delivers the results you want.
Is this not kind of trivial, at least the way they've defined it? It's kind of very obviously NP complete to me. Any of these text problems where you're tying to optimize over a whole corpus is kind of not hard to see to be NP-complete, similar to longest common subsequence.
Longest common subsequence can be solved in at most quadratic time via dynamic programming.
Longest common subsequence for an arbitrary input is NP-hard: https://en.wikipedia.org/wiki/Longest_common_subsequence
Some narrow versions of it could be optimally solved with DP, but when the constraints are lifted, you will probably have to pay for it with either exponential memory or time.
Same applies to the Knapsack problem. You can solve some variants of it with DP, but it won't generalize.
Longest common substring is linear time.
You may have misread the parent comment. Longest common substring is not the same type of problem as longest common subsequence.
For those who were wondering what this means:
> We still do not know, for instance, what makes a good tokeniser (Gowda and May, 2020; Cognetta et al., 2024): which characteristics should its produced subwords `s` have to be a good starting point for language modelling? If we knew this, then we could define an objective function which we could evaluate tokenisers with.
I don't see how the authors get past this true general statement from the first paragraph of the introduction. Finding a good tokenizer is not just NP-hard; we have no idea how hard it might be because we don't have theoretical agreement on what "good" means.
In order to have something to prove, the authors decide (somewhat arbitrarily):
> Specifically, we focus on finding tokenisers that maximise the compression of a text. Given this objective, we then define the tokenisation problem as the task of finding a tokeniser which compresses a dataset to at most δ symbols.
Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.
I'm also not sure how realistic the limitation to "at most δ symbols" is. I mean, that limit is undeniably useful to make the proof of NP-completeness go through, because it's a similar mechanism to the minimum number of satisfied clauses in the MAX-2-SAT definition. But why not just keep adding tokens as needed, rather than imposing any preordained limit? IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings. When that tokenizer was being designed, I don't imagine they worried much if the final number had been 60k or even 100k. How could you possibly choose a meaningful δ from first principles?
To put that point a different way, imagine the authors had proven NP-completeness by reduction from the Knapsack Problem, where the knapsack you're packing has some maximum capacity. If you can easily swap your knapsack out for a larger knapsack whenever it gets (close to) full, then the problem becomes trivial.
If the authors managed to prove that any arbitrary objective function would lead to NP-hard tokenizer optimization problem, then their result would be more general. If the paper proves that somehow, I missed it.
I suppose this paper suggests "here be dragons" in an interesting if incomplete way, but I would also say there's no need to hurt yourself with an expensive optimization problem when you're not even sure it delivers the results you want.
For a second I thought this was related to DeFi... and then I read the abstract.