kelseyfrog 19 hours ago

Tokenizers aren't considered the "sexy" part of LLMs, but where others see boring, I see opportunity. Papers like xVal[1], point toward specialization strategies in tokenization. Spelling and letter tasks are another problem that could benefit from innovation on the tokenization.

LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission. GPT-4o, for example, writes a small python program and executes it in order to count letter instances. We all know that tokenization effectively erases knowledge about letters in prompts and directly negatively impacts performance at these tasks, yet we haven't found a way to solve it.

1. https://ar5iv.labs.arxiv.org/html/2310.02989

  • bunderbunder 16 hours ago

    This was ages ago, in the pre-transformer era, and I can't find the link anymore. But once upon a time I read a great paper that demonstrated that most of the performance differences being reported among popular embedding models of the time were better explained by text cleaning and tokenization than they were by the embedding model itself.

    In other words, if you train a model using word2vec's preprocessing and GloVe's algorithm, the result looks more like a "standard-issue" word2vec model than a "standard-issue" GloVe model.

    • authorfly 2 hours ago

      Yes those models were sensitive to the preprocessing, far more than transformers.

      However Word2Vec and GloVe were fundamentally different, when used as designed GloVe worked better pretty uniformly.

  • screye 18 hours ago

    Tokenizers face an odd compute issue.

    Since they're part of the pre-processing pipeline, you can't quickly test them out for effectiveness. You have to restart a pretraining run to test downstream effectiveness.

    Separately,

    As much as an attention module can do universal nonlinear transformations....I wonder if it makes sense to add specifuc modules for some math primitives as well. I remember that the executor paper [1] (slightly precursor to the attention is allyou need paper) created self contained modules for operations like less than, count, sum and then explicitly orchestrated them in the decoder.

    I'm surprised we haven't seen such solutions produce sota results from math-ai or code-ai research communities.

    [1] https://arxiv.org/abs/1705.03633

  • IncreasePosts 19 hours ago

    What's the issue with character-level tokenization(I assume this would be much better at count-the-letter tasks)? The article mentions it as an option but doesn't talk about why subword tokenization is preferred by most of the big LLMs out there.

    • stephantul 19 hours ago

      Using subwords makes your sequences shorter, which makes them cost less.

      Besides that, for alphabetic languages, there exists almost no relation between form and meaning. I.e.: “ring” and “wing” differ by one letter but have no real common meaning. By picking the character or byte as your choice of representation, the model basically has to learn to distinguish ring and wing in context. This is a lot of work!

      So, while working on the character or byte level saves you some embeddings and thus makes your model smaller, it puts all of the work of distinguishing similar sequences with divergent meanings on the model itself, which means you need a larger model.

      By having subwords, a part of this distinguishing work already has been done by the vocabulary itself. As the article points out, this sometimes fails.

      • sundarurfriend 14 hours ago

        > Besides that, for alphabetic languages, there exists almost no relation between form and meaning.

        Also true for Abugida-based languages, for eg. சரம் (saram = string) vs மரம் (maram = tree), and many more. I think your intention with specifying "alphabetic languages" was to say "non-logographic languages", right?

        • bunderbunder 14 hours ago

          I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

          And even in Chinese it's a fairly weak relationship. A large portion of the meanings of individual characters come from sound loan. For example the 英 in 英雄 means "hero", in 英语 means "England", an in 精英 means "flower". The relationship there is simple homophony.

          On the other hand, one thing you do get with written Chinese is that "1 character = 1 morpheme" very nearly works. So mechanistically breaking a text into a sequence of morphemes can be done pretty reliably without the aid of a semantic model or exhaustive hard-coded mapping. I think that for many other languages you can't even get close using only syntactic analysis.

          • thaumasiotes 11 hours ago

            > I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

            Written Japanese is much more ideographic than written Chinese. Japanese spelling is determined, such as it is, by semantics. Chinese spelling is determined by sound. Thus, 女的, 娘们, and 妮子, all meaning 'girl' or 'woman', have no spelling in common because they are different words, while Japanese uses 女 for "jo" and "onna" despite a total lack of any relationship between those words.

        • stephantul 8 hours ago

          I was trying to say “at least for alphabetic languages”. I don’t like to say things about languages I can’t speak or write. So, no, it wasn’t my intention to say “non-logographic languages”

      • p1esk 15 hours ago

        Has anyone tried to combine a token embedding with some representation of the characters in the (sub)word? For example, use a 512 long vector to represent a token, and reserve the last 12 values to spell out the word.

        • mattnewton 14 hours ago

          I'm not following - spell out the word how? Like put the actual bytes as numerical input to the transformer layer?

          • p1esk 13 hours ago

            Yes

            • stephantul 8 hours ago

              Not that I know of, but encoding orthography in a fixed width vector usually carries the assumption that words with the same prefix are more similar. So there’s an alignment problem. You usually solve this using dynamic programming, but that doesn’t work in a vector.

              For example “parent” and “parents” are aligned, they share letters in the same position, but “skew” and “askew” share no letters in the same position.

              • p1esk 7 hours ago

                The other 500 values in the skew/askew vectors will be similar though. The 12 character values don’t need to be aligned, their function is to provide spelling. Adding such info will probably help LLM answer questions requiring character level knowledge (e.g. counting ‘r’s in ‘strawberry’).

        • RicoElectrico 15 hours ago

          Well, fastText uses character n-grams to compute embeddings for out-of-vocabulary words. This is pre-transformers work BTW.

          • p1esk 8 hours ago

            IIRC, overlapping ngram vectors are summed to form the token embedding - doesn’t it effectively destroy any character level representation of the token? Doesn’t really make sense to me.

            • stephantul 8 hours ago

              It works because they use really large ngram values, up to 6. So most character-level information is in these subwords.

              • p1esk 7 hours ago

                Let’s say we want to use 6-grams and build an embedding vector for the word “because”: we add integer vectors for “becaus” and “ecause”, right? For example: [1,2,3,4,5,6] + [2,3,4,5,6,2] = [3,5,7,9,11,8]. Obviously we cannot use this resulting numerical vector to spell the input word. Pretty much all character level info is lost.

      • bunderbunder 16 hours ago

        I suspect that the holy grail here is figuring out how to break the input into a sequence of morphemes and non-morpheme lexical units.

        • thaumasiotes 11 hours ago

          What do you mean by non-morpheme lexical units? Syntactic particles, units too small to be morphemes? Lexical items that contain multiple morphemes?

          In either case, isn't this something we already do well?

    • SEGyges 19 hours ago

      tokens are on average four characters and the number of residual streams (and therefore RAM) the LLM allocates to a given sequence is proportionate to the number of units of input. the flops is proportionate to their square in the attention calculation.

      you can hypothetically try to ameliorate this by other means, but if you just naively drop from tokenization to character or byte level models this is what goes wrong

      • p1esk 6 hours ago

        4x seq length expansion doesn’t sound that bad.

        • lechatonnoir 3 hours ago

          I mean, it's not completely fatal, but it means an approximately 16x increase in runtime cost, if I'm not mistaken. That's probably not worth trying to solve letter counting in most applications.

    • Centigonal 19 hours ago

      I think it has to do with both performance (smaller tokens means more tokens per sentence read and more runs per sentence generated) and with how embeddings work. You need a token for "dog" and a token for "puppy" to represent the relationship between the two as a dimension in latent space.

    • cma 17 hours ago

      Context length performance and memory scales N^2. Smaller tokens mean worse scaling, up to a point.

  • kaycebasques 18 hours ago

    > but where others see boring, I see opportunity

    I feel this way about embeddings

    This line of thought seems related to the old wisdom of finding innovative solutions by mucking around in the layer below whatever the "tools of the trade" are for your domain

  • doctorpangloss 17 hours ago

    > LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission.

    If it were so simple, why hasn’t this already been dealt with?

    Multimodal VQA models also have had a hard time generalizing counting. Counting is not as simple as changing the tokenizer.

    • kelseyfrog 17 hours ago

      I'm saying the oulipo rule is simple, not the task given current tokenization methods

    • danielmarkbruce 12 hours ago

      Should the number 23 be tokenized as one token or two tokens?

      • doctorpangloss 8 hours ago

        It doesn’t matter. The challenge with counting doesn’t have to do with tokenization. Why this got into the zeitgeist, I don’t know.

        • imtringued 5 hours ago

          No LLM struggles with two digit arithmetic. 100 digit addition is possible with the use of state of the art position encodings. Counting is not bottlenecked by arithmetic at all.

          When you ask an LLM to count the number of "r" in the word Strawberry, the LLM will output a random number. If you ask it to separate the letters into S t r a w b e r r y, then each letter is tokenized independently and the attention mechanism is capable of performing the task.

          What you are doing is essentially denying that the problem exists.

      • tomrod 12 hours ago

        We already solved that with binary representation ;-)

      • thaumasiotes 11 hours ago

        Two. That's the reality.

        You interpret the token sequence by constructing a parse tree, but that doesn't require you to forget that the tokens exist.

        • danielmarkbruce 10 hours ago

          If you use standard BPE, you likely won't tokenize every number by it's digits, depending on the data set used to create the tokenizer.

          The point is, you have a choice. You can do the tokenization however you like. The reason 23 is interesting is that there is a case to be made that a model will more likely understand 23 is related to Jordan if it's one token, and if it's two tokens it's more difficult. The opposite is true for math problems.

          The reality is whatever we want to make it. It's likely that current schemes are... sub optimal. In practice it would be great if every token was geometrically well spaced after embedding, and preserve semantic information, among other things. The "other things" have taken precedent thus far.

Joker_vD 18 hours ago

> You need to understand [the input data] before you can do anything meaningful with it.

IMHO that's the main reason people turn to any sort of automated data-processing tools in the first place: they don't want to look at the input data. They'd rather have "the computer" look at it and maybe query them back with some additional info gathering requests. But thinking on their own? Ugh.

So I boldly propose the new definition of AGI: it's the data-processing entity that will (at last!) reliably liberate you from having to look at your data before you start shoving this data into that processing entity.

  • bunderbunder 16 hours ago

    Over the past year I've encountered so many situations where a person's opinion of how well an LLM accomplishes a task actually says more about that person's reading comprehension skills than it does the LLM's performance. This applies to both positive and negative opinions.

HanClinto 15 hours ago

I really appreciated this blog post, and in particular I appreciated the segment talking about typos.

We were discussing this earlier this week -- I'm helping with a RAG-like application for a project right now, and we're concerned with how much small typos or formatting differences in users' queries can throw off our embedding distances.

One thought was: Should we be augmenting our training data (or at the very least, our pretraining data) with intentional typos / substitutions / capitalizations, just to help it learn that "wrk" and "work" are probably synonyms? I looked briefly around for typo augmentation for (pre)training, and didn't see anything at first blush, so I'm guessing that if this is a common practice, that it's called something else.

  • authorfly 2 hours ago

    Check out the training data. Sentence transformer models training data includes lots of typos and this is desirable. There was debate around training/inferencing with stemmed/postprocessing words for a long time.

    Typos should minimally impact your RAG.

  • tmikaeld 15 hours ago

    I work with full text search where this is common. Here is some points.

    Stemming: Reducing words to their base or root form (e.g., “working,” “worked” becoming “work”).

    Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).

    Token normalization: Standardizing tokens, such as converting “wrk” to “work” through predefined rules (case folding, character replacement).

    Fuzzy matching: Allowing approximate matches based on edit distance (e.g., “wrk” matches “work” due to minimal character difference).

    Phonetic matching: Matching words that sound similar, sometimes used to match abbreviations or common misspellings.

    Thesaurus-based search: Using a predefined list of synonyms or alternative spellings to expand search queries.

    Most of these are open and free lists you can use, check the sources on manticore search for example.

    • soared 13 hours ago

      Porter stemming is currently widely used in adtech for keywords.

    • thaumasiotes 11 hours ago

      > Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).

      I don't understand. How is that different from stemming? What's the base form of "better" if not "good"? The nature of the relationship between "better" and "good" is no different from that between "work" and "worked".

      • authorfly 2 hours ago

        Stemming is basically rules based on the characters. It came first.

        This is because most words in most languages follow patterns of affixes/prefixes (e.g. worse/worst, harder/hardest), but not always (good/better/best)

        The problem was that word/term frequency based modelling would inappropriately not linked terms that actually had the same route (stam or stem).

        Stemming removed those affixes so it turned "worse and worst" into "wor and wor" and "harder/hardest" into "hard", etc.

        However it failed for cases like good/better.

        Lemmatizing was a larger context and built up databases of word senses linking such cases to more reliably process words. So lemmatizing is rules based, plus more.

        • thaumasiotes an hour ago

          > So lemmatizing is rules based, plus more.

          Fundamentally, the rule of lemmatizing is that you encounter a word, you look it up in a table, and your output is whatever the table says. There are no other rules. Thus, the lemma of seraphim is seraph and the lemma of interim is interim. (I'm also puzzled by your invocation of "context", since this is an entirely context-free process.)

          There has never been any period in linguistic analysis or its ancestor, philology, in which this wasn't done. The only reason to do it on a computer is that you don't have a digital representation of the mapping from token to lemma. But it's not an approach to language processing, it's an approach to lack of resources.

  • andix 14 hours ago

    For queries there is an easy solution: give the question/search term to a LLM and let it rephrase it. A lot of basic RAG examples do that.

    This might also work for indexing your data, but has the potential to get really expensive quickly.

  • bongodongobob 14 hours ago

    I'm glad this is mentioned. I've suspected that using correct grammar, punctuation and spelling greatly impacts response quality. It's hard to objectify so I've just decided to write my prompts in perfect English just to be sure. I have a friend who prompts like he texts and I've always felt he was getting lower quality responses. Not unusable, just a little worse, and he needs to correct it more.

cranium 6 hours ago

I finally understood the weirdness of tokenizers after watching the video Andrej Karpathy made: "Let's build the GPT Tokenizer" (https://www.youtube.com/watch?v=zduSFxRajkE).

He goes through why we need them instead of raw byte sequences (too expensive) and how the Byte Pair Encoding algorithm works. Worth spending 2h for the deeper understanding if you deal with LLMs.

yoelhacks 18 hours ago

I used to work on an app that very heavily leaned on Elasticsearch to do advanced text querying for similarities between a 1-2 sentence input and a corpus of paragraph+ length documents.

It was fascinating how much tokenization strategies could affect a particular subset of queries. A really great example is a "W-4" or "W4" Standard tokenization might split on the "-" or split on letter / number boundaries. That input now becomes completely unidentifiable in the index, when it otherwise would have been a very rich factor in matching HR / salary / tax related content.

Different domain, but this doesn't shock me at all.

  • carom 16 hours ago

    The trained embedding vectors for the token equivalents of W4 and W-4 would be mapped to a similar space due to their appearance in the same contexts.

    • dangerlibrary 13 hours ago

      The point of the GP post is that the "w-4" token had very different results from ["w", "-4"] or similar algorithms where the "w" and "4" wound up in separate tokens.

  • AStrangeMorrow 12 hours ago

    Yes, used to work on a system that has elasticsearch and also some custom Word2Vec models. What had the most impact on the quality of the search is ES and on the quality of our W2V model were tokenization and a custom ngrams system.

Xenoamorphous 18 hours ago

> One of the things I noticed over the past year is how a lot of developers who are used to developing in the traditional (deterministic) space fail to change the way they should think about problems in the statistical space which is ultimately what LLM apps are.

I’m a developer and don’t struggle with this, where I really struggle is trying to explain this to users.

maytc 5 hours ago

The difference in the dates example seems right to me 20 October 2024 and 2024-20-10 are not the same.

Months in different locales can be written as yyyy-MM-dd. It can also be a catalog/reference number. So, it seems right that their embedding similarity is not perfectly aligned.

So, it's not a tokenizer problem. The text meant different things according to the LLM.

bcherry 18 hours ago

It's kind of interesting because I think most people implementing RAG aren't even thinking about tokenization at all. They're thinking about embeddings:

1. chunk the corpus of data (various strategies but they're all somewhat intuitive)

2. compute embedding for each chunk

3. generate search query/queries

4. compute embedding for each query

5. rank corpus chunks by distance to query (vector search)

6. construct return values (e.g chunk + surrounding context, or whole doc, etc)

So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval.

r_hanz 10 hours ago

Very nicely written article. Personally, I find RAG (and more abstractly, vector search) the only mildly interesting development in the latest LLM fad, and have always felt that LLMs sit way too far down the diminishing returns curve to be interesting. However, I can’t believe tokenization and embeddings in general, are not broadly considered the absolutely most paramount aspect of all deep learning. The latent space your model captures is the most important aspect of the whole pipeline, or else what is any deep learning model even doing?

halyax7 17 hours ago

an issue I've seen in several RAG implementations is assuming that the target documents, however cleverly they're chunked, will be good search keys for incoming queries. Unless your incoming search text looks semantically like the documents you're searching over (not the case in general), you'll get bad hits. On a recent project, we saw a big improvement in retrieval relevance when we separated the search keys from the returned values (chunked documents), and we used an LM to generate appropriate keys which were then embedded. Appropriate in this case means "sentences like what the user might input if theyre expecting this chunk back"

  • marlott 15 hours ago

    Interesting! So you basically got a LM to rephrase the search phrase/keys into the style of the target documents, then used that in the RAG pipeline? Did you do an initial search first to limit the documents?

    • NitpickLawyer 14 hours ago

      IIUC they're doing some sort of "q/a" for each chunk from documents, where they ask an LLM to "play the user role and ask a question that would be answered by this chunk". They then embed those questions, and match live user queries with those questions first, then maybe re-rank on the document chunks retrieved.

andix 14 hours ago

This is an awesome article, but I’m missing the part where solutions for each of the problems were discussed.

Run a spell check before tokenizing? Maybe even tokenize the misspelled word and the potential corrected word next to each other like „misspld (misspelled)“?

For the issue with the brand names the tokenizer doesn’t know, I have no idea how to handle it. This problem is probably even worse in less common languages, or in languages which use a lot of compound words.

ratedgene 18 hours ago

Can't someone expand on this

> Chunking is more or less a fixable problem with some clever techniques: these are pretty well documented around the internet;

Curious about what chunking solutions are out there for different sets of data/problems

  • hansvm 16 hours ago

    It's only "solved" if you're okay with a 50-90% retrieval rate or have particularly nice data. There's a lot of stuff like "referencing the techniques from Chapter 2 we do <blah>" in the wild, and any chunking solution is unlikely to correctly answer queries involving both Chapter 2 and <blah>, at least not without significant false positive rates.

    That said, the chunking people are doing is worse than the SOTA. The core thing you want to do is understand your data well enough to ensure that any question, as best as possible, has relevant data within a single chunk. Details vary (maybe the details are what you're asking for?).

  • pphysch 17 hours ago

    Most data has semantic boundaries: whether tokens, words, lines, paragraphs, blocks, sections, articles, chapters, versions, etc. and ideally the chunking algorithm will align with those boundaries in the actual data. But there is a lot of variety.

quirkot 16 hours ago

Is this true?

>> Do not panic! A lot of the large LLM vocabularies are pretty huge (30k-300k tokens large)

Seems small by an order of magnitude (at least). English alone is 1+ millions words

  • macleginn 14 hours ago

    Most of these 1+ million words are almost never used, so 200k is plenty for English. Optimistically, we hope that rarer words would be longer and to some degree compositional (optim-ism, optim-istic, etc.), but unfortunately this is not what tokenisers arrive at (and you are more likely to get "opt-i-mis-m" or something like that). People have tried to optimise tokenisation and the main part of LLM training jointly, which leads to more sensible results, but this is unworkable for larger models, so we are stuck with inflated basic vocabularies.

    It is also probably possible now to go even for larger vocabularies, in the 1-2 million range (by factorising the embedding matrix, for example), but this does not lead to noticeable improvements in performance, AFAIK.

    • Der_Einzige 13 hours ago

      Performance would be massively improved on constrained text tasks. That alone makes it worth it to expand the vocabulary size.

  • mmoskal 16 hours ago

    Tokens are often sub-word, all the way down to bytes (which are implicitly understood as UTF8 but models will sometimes generate invalid UTF8...).

  • spott 9 hours ago

    BPE is complete. Every valid Unicode string can be encoded with any BPE tokenizer.

    BPE basically starts with a token for every valid value for a Unicode byte and then creates new tokens by looking at common pairs of bytes (‘t’ followed by ‘h’ becomes a new token ’th’)

Spivak 19 hours ago

I think I take something different away from the article, yes tokenizers are important but they're a means to get at something much much bigger which is how to clean up and normalize unstructured data. It's a current endeavor of mine at $dayjob for how to do this in a way that can work reasonably well even for badly mangled documents. I don't have any silver bullets, at least nothing worthy of a blog-post yet, but since this is needed when dealing with OCR documents so "post-ocr correction" turns up quite a few different approaches.

And this is an aside, but I see folks using LLMs to do this correction in the first place. I don't think using LLMs to do correction in a multi-pass system is inherently bad but I haven't been able to get good results out of "call/response" (i.e. a prompt to clean up this text). The best results are when you're running an LLM locally and cleaning incrementally by using token probabilities to help guide you. You get some candidate words from your wordlist based on the fuzzy match of the text you do have, and candidate words predicted from the previous text and when both align -- ding! It's (obviously) not the fastest method however.

  • SEGyges 19 hours ago

    you might have better luck giving the LM the original document and having it generate its own OCR independently, then asking the llm to tiebreak between its own generation and the OCR output while the image is still in the context window until it is satisfied that it got things correct

  • 7thpower 18 hours ago

    This is interesting. What types of content are you using this approach on and how does it handle semi structured data? For instance, embedded tables.

woolr 15 hours ago

Can't repro some of the numbers in this blog post, for example:

  from sentence_transformers import SentenceTransformer
  from sentence_transformers import util

  model = SentenceTransformer('all-MiniLM-L6-v2')

  data_to_check = [
    "I have recieved wrong package",
    "I hve recieved wrong package"
  ]
  embeddings = model.encode(data_to_check)
  util.cos_sim(embeddings, embeddings)
Outputs:

  tensor([[1.0000, 0.9749],
        [0.9749, 1.0000]])
  • 1986 15 hours ago

    Your data differs from theirs - they have "I have received wrong package" vs "I hve received wrong pckage", you misspelled "received" in both and didn't omit an "a" from "package" in the "bad" data