Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

405 points by JnBrymn 4 days ago

https://xcancel.com/karpathy/status/1980397031542989305

js8 3 days ago

Not pixels, but percels. Pixels are points in the image, while a "percel" is unit of perceptual information. It might be a pixel with an associated sound, in a given moment of time. In case of humans, percels include other senses as well, and they can also be annotated with your own thoughts (i.e. percels can also include tokens or embeddings).

Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.

almoehi 2 days ago

I’ve had written up a proposal for a research grant to basically work exactly on this idea.
It got reviewed by 2 ML scientists and one neuroscientist.
Got totally slammed (and thus rejected) by the ML scientists due to „lack of practical application“ and highly endorsed by the neuroscientist.
There’s so much unused potential in interdisciplinary research but nobody wants to fund it because it doesn’t „fit“ into one of the boxes.
- behnamoh 2 days ago
  
  Make sure the ML scientists don't take credit for your work. Sometimes they reject a paper so they can work on it on their own.
  
  almoehi 2 days ago
  
  Grant reviews are blind reviews - so you don’t know. Also - and even worse - there is no rebuttal process. It gets rejected without you having a chance to clarify / convince reviewers.
  Instead you’d need to resubmit and start the entire process from scratch. What a waste of resources …
  It’s the final nail what made me quit pursuing a scientific career path despite having good pubs & PhD /w honours.
  Unfortunately it’s what I enjoy the most.
- Enginerrrd 2 days ago
  
  That's unfortunate. My personal sense is that while agentic LLM's are not going to get us close to AGI, a few relatively modest architectural changes to the underlying models might actually do that, and I do think mimicry of our own self-referential attention is a very important component of that.
  While the current AI boom is a bubble, I actually think that AGI nut could get cracked quietly by a company with even modest resources if they get lucky on the right fundamental architectural changes.
  
  almoehi 2 days ago
  
  I agree - and I think having interdisciplinary approach here is going to increase the odds here. There is a ton of useful knowledge in related disciplines - often just named differently - but turns out investigating the same problem from a different angle.
- shepardrtc 2 days ago
  
  Sounds like those ML "scientists" were actually just engineers.
  
  verdverm 2 days ago
  
  A lot of progress is made through engineering challenges
  This is also "science"
falcor84 2 days ago

I love this idea, but can't find anything about it. Is this a neologism you just coined? If so, is there any particular paper or work that led you to think about in those terms?
- js8 2 days ago
  
  Yes, I just coined the neologism. It was supposed to be partly sarcastic (why stay at pixels, why not just go fully multimodal and treat the missing channels as missing information?), I am kind of surprised why it got so upvoted.
  (IME, often my comments which I think are deep get ignored but silly things, where I was thinking "this is too much trolling or obvious", get upvoted; but don't take it the wrong way, I am flattered you like it.)
  
  causal 2 days ago
  
  Pretending channels can be effectively merged into a single percel vector, that would open up interesting channels beyond human perception even, e.g. lidar. Or it would be interesting to train a model that feels at home in 4D space.
  
  jaredhansen 2 days ago
  
  I think there's a decent chance you may have just created the ideal name for what will become one of the most important concepts ever. Bravo!
  
  SJMG 2 days ago
  
  Deep things often, not always, take more attention to appreciate than the superficial. It's a precious resource people are seldom disposed to allocate a lot of when headline-surfing HN.
  
  throwaway-aws9 2 days ago
  
  Should future attributions in white papers go to js8 from HN?
Workaccount2 2 days ago

Isn't this effectively what the latent space is? A bunch of related vectors that all bundle together?
- js8 2 days ago
  
  No, latent space doesn't have to be made of percels, just like not every 2D array of 3-element vectors is an image made of pixels. Percels are tied to your sensors, components of what you perceive, in totality.
  Of course there is an interesting paradox - each layer of the NN doesn't know whether it's connected to the sensors directly, or what kind of abstractions it works with in the latent space. So the boundary between the mind and the sensor is blurred and to some extent a subjective choice.
- taneq 2 days ago
  
  “Percel” is still a way cooler and arguably more descriptive term than “token” though.
causal 2 days ago

This is an interesting thought. Trying to imagine how you represent that as a vector.
You still need to map percels to a latent space. But perhaps with some number of dimensions devoted to modes of perception? E.g. audio, visual, etc
- milanove 2 days ago
  
  I'm not an ML expert or practitioner, so someone might need to correct me.
  However, I believe the parcel's components together as a whole would capture the state of the audio+visual+time. However, I don't think the state of one particular mode (e.g. audio or visual or time) is encoded with a specific subset of the percel's components. Rather, each component of the percel itself would represent a mixture (or a portion of a mixture) of the audio+video+time. So, you couldn't isolate out just the audio or visual or time state specifically by looking at some specific subset of the percel's components, because each component is itself a mixture of the audio+visual+time state.
  I think the classic analogy is that if river 1 and river 2 combine to form river 3, you cannot take a cup of water from river 3 and separate out the portions from river 1 and river 2; they're irreversibly mixed.
BrokenCogs 2 days ago

I was going to say toxel
- causal 2 days ago
  
  Like a tokenized 3D voxel?
  
  BrokenCogs 2 days ago
  
  Tokenized pixel. I understand now that's not what js8 was talking about, so my original comment doesn't really make sense
szundi 2 days ago

[dead]

tcdent 3 days ago

"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.

Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.

It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.

dgently7 3 days ago

I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?
Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.
- visarga 3 days ago
  
  Funny, I habitually read while engaging TTS on same text. I have even made a Chrome extension for web reading, it highlights text and reads it, while keeping the current position in the viewport. I find using 2 modalities at the same time improves my concentration. TTS is sped up to 1.5x to match reading speed. Maybe it is just because I want to reduce visual strain. Since I consume a lot of text every day, it can be tiring.
  
  fluidcruft 2 days ago
  
  This is also feature is built into Edge (and I agree it's great, but I mostly use it so I can listen to pages while doing chores around the office/closing my eyes.
  What I would love is an easy way to just convert the page to a mp3 that queues into my podcast app to listen to while taking a walk or driving. It probably exists, but I haven't spent a lot of time looking into it.
  
  Version467 3 days ago
  
  I do this too. It's great. The term I've seen used to describe this is 'Immersion Reading'. It seems to be quite a popular way for neurodivergent people to get into reading.
  
  gavinray 2 days ago
  
  Any chance you could share the source?
  I found that I can read better if individual words or chunks are highlighted in alternating pastel colors while I scan then with my eyes
  
  lukevp 3 days ago
  
  What’s your extension? Sounds interesting!
  
  zirror 2 days ago
  
  Just FYI, Firefox reader mode does the same thing. It's a little button in the address bar.
  
  FergusArgyll 2 days ago
  
  Reading mode in chrome does this too. Although the tts sounds like it's far behind sota
  
  trenchpilgrim 2 days ago
  
  Probably because it needs to run locally on older CPUs, so it's likely using an old-school phonemizer that will run on a 15 year old computer.
- psadri 3 days ago
  
  The pixel to sounds would pass through “reading” so there might be information loss. It is no longer just pixels.
317070 3 days ago

There was the Byte Latent Transformer, to end the tokenizer, which seemingly went nowhere. https://ai.meta.com/research/publications/byte-latent-transf...
- htrp 2 days ago
  
  fair team currently subject to tbd labs politics
Tarq0n 2 days ago

Ok but what are you going to decode into at generation time, a jpeg of text? Tokens have value beyond how text appears to the eye, because we process text in many more ways than just reading it.
- jhanschoo 2 days ago
  
  There are some concerns here that should be addressed separately:
  > Ok but what are you going to decode into at generation time, a jpeg of text?
  Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.
  > we process text in many more ways than just reading it
  As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.
  Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.
- samus 2 days ago
  
  Output really doesn't have to be the same datatypes as the input. Text tokens are good enough for a lot of interesting applications, and transforming percels (name suggested by another commenter here) into text tokens is exactly what an OCR model is anyway trained to do.
ReptileMan 2 days ago

I guess it is because of the absurdly high information density of text - so text is quite a good input.
naasking 2 days ago

Using pixels is still tokenizing. What's needed is something more like "Byte Latent Transformers", which has dynamically sized patches based on information content rather than tokens.
esafak 2 days ago

I do not get it, either. How can a picture of text be better than the text itself? Why not take a picture of the screen while you're at it, so the model learns how cameras work?
- jerojero 2 days ago
  
  In a very simple way: because the image can be fed directly into the network without first having to transform the text into a series of tokens as we do now.
  But the tweet itself is kinda an answer to the question you're asking.
- corysama 2 days ago
  
  From the paper I saw that the model includes an approximation of the layout, diagrams and other images of the source documents.
  Now imagine growing up only allowed to read books and the internet through a browser with CSS, images and JavaScript disabled. You’d be missing out on a lot of context and side-channel information.

orliesaurus 3 days ago

one of the MOST interesting aspects of the recent discussion on this topic is how it underscores our reliance on lossy abstractions when representing language for machines. Tokenization is one such abstraction, but it's not the only one.... using raw pixels or speech signals is a different kind of approximation. what excites me about experiments like this is not so much that we'll all be handing images to language models tomorrow, but that researchers are pressure testing the design assumptions of current architectures. Approaches that learn to align multiple modalities might reveal better latent structures or training regimes, and that could trickle back into more efficient text encoders without throwing away a century of orthography. BUT there’s also a rich vein to mine in scripts and languages that don’t segment neatly into words: alternative encodings might help models handle those better.

nl 3 days ago

Kapathy's points are correct (of course).

One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).

"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.

This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.

harperlee 3 days ago

But assuming that pixel input gets us to an AI capable of reading, they would presumably also be able to detect HWLLO as semantically close to HELLO (similarly to H3LL0, or badly handwritten text - although there would be some graphical structure in these latter examples to help). At the end of the day we are capable of identifying that... Might require some more training effort but the result would be more general.
swyx 3 days ago

im particularly sympathetic to typo learning, which i think gets lost in the synthetic data discussion (mine here https://www.youtube.com/watch?v=yXPPcBlcF8U )
but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP

a_bonobo 3 days ago

Somewhat related:

There's this older paper from Lex Flagel and others where they transform DNA-based text, stuff we'd normally analyse via text files, into images and then train CNNs on the images. They managed to get the CNNs to re-predict population genetics measurements we normally get from the text-based DNA alignments.

https://academic.oup.com/mbe/article/36/2/220/5229930

sabareesh 3 days ago

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

ACCount37 3 days ago

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.
- typpilol 3 days ago
  
  It will require like 20x the compute
  
  ACCount37 3 days ago
  
  A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".
  If we had a million times the compute? We might have brute forced our way to AGI by now.
  
  Jensson 3 days ago
  
  But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.
  
  Mehvix 3 days ago
  
  Why do you suppose this is a compute limited problem?
  
  ACCount37 3 days ago
  
  It's kind of a shortcut answer by now. Especially for anything that touches pretraining.
  "Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
  The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
  A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
  
  typpilol 3 days ago
  
  Thanks.
  Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now
  
  kenjackson 3 days ago
  
  Why so much compute? Can you tie it to the problem?
  
  typpilol 3 days ago
  
  Tokenizers are the reason LLMs are even possible to run at a decent speed on our best hardware.
  Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.
  Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.
  There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.
  So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.
  And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.
  That's why
  
  kenjackson 2 days ago
  
  Thanks. That helps a lot.
CuriouslyC 3 days ago

Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.
- yorwba 3 days ago
  
  You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.
  But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.
- mark_l_watson 3 days ago
  
  Interesting idea! Haven’t heard that before.

bob1029 2 days ago

I think the DCT is a compelling way to interact with spatial information when the channel is constrained. What works for jpeg can likely work elsewhere. The energy compaction properties of the DCT mean you get most of the important information in a few coefficients. A quantizer can zero out everything else. Zig zag scanned + RLE byte sequences could be a reasonable way to generate useful "tokens" from transformed image blocks. Take everything from jpeg encoder except for perhaps the entropy coding step.

At some level you do need something approximating a token. BPE is very compelling for UTF8 sequences. It might be nearly the most ideal way to transform (compress) that kind of data. For images, audio and video, we need some kind of grain like that. Something to reorganize the problem and dramatically reduce the information rate to a point where it can be managed. Compression and entropy is at the heart of all of this. I think BPE is doing more heavy lifting than we are giving it credit for.

I'd extend this thinking to techniques like MPEG for video. All frame types also use something like the DCT too. The P and B frames are basically the same ideas as the I frame (jpeg), the difference is they take the DCT of the residual between adjacent frames. This is where the compression gets to be insane with video. It's block transforms all the way down.

An 8x8 DCT block for a channel of SDR content is 512 bits of raw information. After quantization and RLE (for typical quality settings), we can get this down to 50-100 bits of information. I feel like this is an extremely reasonable grain to work with.

jacquesm 2 days ago

I can listen to music in my head. I don't think this is an extraordinary property but it is kind of neat. That hints at the fact that I somehow must have encoded this music. I can't imagine I'm storing the equivalent of a MIDI file, but I also can't imagine that I'm storing raw audio samples because there is just too much of it.
It seems to work for vocals as well, not just short samples but entire works. Of course that's what I think, but there is a pretty good chance they're not 'entire', but it's enough that it isn't just excerpts and if I was a good enough musician I could replicate what I remember.
Is there anybody that has a handle on how we store auditory content in our memories? Is it a higher level encoding or a lower level one? This capability is probably key in language development so it is not surprising that we should have the capability to encode (and replay) audio content, I'm just curious about how it works, what kind of accuracy is normally expected and how much of such storage we have.
Another interesting thing is that it is possible to search through it fairly rapidly to match a fragment heard to one that I've heard and stored before.
- 0xdeadbeefbabe 2 days ago
  
  > Is there anybody that has a handle on how we store auditory content in our memories?
  It's so weird that I don't know this. It's like I'm stuck in userland.
akrymski 2 days ago

Yes, DCT coefficients work even better than pixels:
https://www.uber.com/blog/neural-networks-jpeg/

dang 3 days ago

Recent and related:

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)

DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)

viraptor 3 days ago

https://xcancel.com/karpathy/status/1980397031542989305

kirubakaran 3 days ago

Thanks. There are also these:
- https://addons.mozilla.org/en-US/firefox/addon/toxcancel/
- https://chromewebstore.google.com/detail/xcancelcom-redirect...
dang 3 days ago

Thanks! Added to toptext also.

shikon7 3 days ago

Seems we're now at a point of time when OCR is doing so well, that printing text out and letting computers literally read it is suggested to be superior to processing the endoded text directly.

Legend2440 3 days ago

Neural networks have essentially solved perception. It doesn't matter what format your data comes in, as long as you have enough of it to learn the patterns.
- Sharlin 2 days ago
  
  The information density of a bitmap representation of text is just silly low compared to normal textual encodings, even compressed.
programmarchy 3 days ago

PDF is arguably a confusing format for LLMs to read.

ianbutler 3 days ago

https://arxiv.org/abs/2510.17800 (Glyph: Scaling Context Windows via Visual-Text Compression)

You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.

scotty79 3 days ago

I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.

sd9 2 days ago

> more information compression (see paper) => shorter context windows, more efficiency

It seems crazy to me that image inputs (of text) are smaller and more information dense than text - is that really true? Can somebody help my intuition?

vjerancrnjak 2 days ago

It must be the tokenizer. Figuring out words from an image is harder (edges, shapes, letters, words, ...), yet internal representations are more efficient.
I always found it strange that tokens can't just be symbols but instead there's an alphabet of 500k tokens, completely removing low level information from language (rhythm, syllables, etc.), side-effect being a simple edge case of 2 rs in strawberry, or no way to generate predefined rhyming patterns (without constrained sampling). There's an understandable reason for these big token dictionaries, but feels like a hack.
krackers 2 days ago

See this thread https://news.ycombinator.com/item?id=45640720
As I understood the responses, the benefit comes from making better use of the embedding space. BPE tokenization is basically like a fixed lookup table, whereas when you form "image tokens" you just throw each 16x16 patch into a neural-net and (handwave) out comes your embedding. From that, it should be fairly intuitive that since current text tokenization embedding vectors won't even form a subspace (it can only just be ~$VOCAB_SIZE points), image tokens have the capacity to be more information dense. And you might hope that the neural network can somehow make use of that extra capacity, as you're not encoding one subword at a time.
spiderfarmer 2 days ago

I absolutely think that it can, but it depends on what mean you associate with each pixel.

koushikn 2 days ago

Is it feasible that if we have a tokeniser that works on ELF (or PE/COFF) binaries, then we could have LLMs trained on existing binaries and have them generate binary code directly, skipping the need for programming languages?

kkukshtel 2 days ago

I've thought about this a lot, and it comes down ultimately to context size. Programming languages themselves are sort of a "compression technique" for assembly code. Current models even at the high end (1M context windows) do not have near enough workable context to be effective at writing even trivial programs in binary or assembly. For simple instructions sure, but for now the compression of languages (or DSLs) is a context efficiency.
- koushikn 2 days ago
  
  Wouldn't all binaries be in the training data, rather than the context? And output context could be in pieces, with something concatenating the pieces into a working binary?
  ChatGPT claims its possible, but not allowed due to OpenAI safety rules: https://chatgpt.com/share/68fb0a76-6bf8-800c-82f7-605ff9ca22...
anon291 2 days ago

Possible but not precise depending on your use case. LLM compilers would suffer from the same sort of propensity to bugs as humans.
trollbridge 2 days ago

I can attest that existing LLMs work surprisingly well for disassembly.

varispeed 3 days ago

Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

sosodev 3 days ago

LLMs don't "read" text sequentially, right?
- olliepro 3 days ago
  
  The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.
  
  Merik 3 days ago
  
  Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...
  
  ACCount37 3 days ago
  
  Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.
  Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.
- anon291 2 days ago
  
  If the attention is masked, then yes they do.
krackers 2 days ago

Sure, but when people listen to speech it is literally one word at a time. So while there might be some benefit to being able to read non-linearly, it's probably not a bottleneck.
sota_pop 2 days ago

I absolutely don’t “read the text all at once” and do read “left to right”. Could be why I usually find that my reading speed is slower than most. Although I’ve never really had a hard time with comprehension or remembering details.
- jerojero 2 days ago
  
  I remember doing speed reading courses back when I was young and a big part of it was learning to read a paragraph diagonally.
  Its much, much faster. At first there's a loss of understanding of course but once you've practiced enough you will be much faster.
jb1991 3 days ago

I think you’re making a lot of assumptions about how people read.
- com2kid 3 days ago
  
  He isn't, plenty of studies have been done on the topic. Eyes dart around a lot when reading.
  
  jb1991 3 days ago
  
  People do skip words or scan for key phrases, but reading still happens in sequence. The brain depends on word order and syntax to make sense of text, so you cannot truly read it all at once. Skimming just means you sample parts of a linear structure, not that reading itself is non-linear. Eye-tracking studies confirm this sequential processing (check out the Rayner study in Psychological Bulletin if you are interested).
  
  com2kid 2 days ago
  
  Thanks for the reference!
  Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).
  There is an interesting discussion down thread about ADHD and sequential reading. As someone who has ADHD I may be biased by how my brain works. I definitely don't read strictly linearly, there is a lot of jumping around and assembling of text.
  
  dahart 2 days ago
  
  > Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).
  My initial reaction was to say speak for yourself about what reading is or isn’t, and that text is written linearly, but the more I think about it, the more I think you have a very good point. I think I read mostly linear and don’t often look ahead for punctuation. But sentence punctuation changes both the meaning and presumed tone of words that preceded it, and it’s useful to know that while reading the words. Same goes for something like “, Barry said.” So meaning in written text is definitely not 100% linear, and that justifies reading in non-linear ways. This, I’m sure, is one reason that Spanish has the pre-sentence question mark “¿”. And I think there are some authors who try to put who’s talking in front most of the time, though I can’t name any off the top of my head.
  
  jb1991 2 days ago
  
  You may very well skip ahead for context, etc, and that is fine, but that doesn't mean you are actually reading out of order. It's one thing to get distracted or interested in other parts of a sentence or paragraph and jump around. But ultimately, if you are actually gathering the meaning that was written, you have to consume the words linearly at some point. Perhaps with ADHD you just have to endure some distractions on the way to doing so.
  
  varispeed 2 days ago
  
  That's not exactly correct. You can totally read whole sentences or paragraphs at once without having to piece individual words together.
  I can give you an analogy that should hopefully help. If you look at a house, you don't look at the doors, windows, facade, roof individually, then ponder how they are related together to come to a conclusion that it is a house. You immediately know. This is similar with reading. It might require practice though (and a lot of reading!).
  
  jb1991 2 days ago
  
  Your comparison makes no sense to me. Looking at an object and understanding what it is completely different than processing a sequential series of symbols that are designed to have meaning due to a linear order.
spiralcoaster 3 days ago

What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!
- ants_everywhere 3 days ago
  
  I do this. I'm autistic and have ADHD so I'm not representative of the normal person. However, I don't think this is entirely uncommon.
  The relevant technical term is "saccade"
  > ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.
  > Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.
  https://eyewiki.org/Saccade
  Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading
  
  alwa 3 days ago
  
  I do this too. I suspect it may involve a subtly different mechanism from the saccade itself though? If the saccade is the behavior, and per the eyewiki link skimming is a voluntary type of saccade, there’s still the question of what leads me to use that behavior when I read (and others to read more linearly). Although you could certainly watch my eyes “saccade” around as I move nonlinearly through a passage, I’m not sure it’s out of a lack of control.
  Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.
  That eyewiki entry was really cool. Among the unexpectedly interesting bits:
  > The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].
  
  ants_everywhere 2 days ago
  
  If you're an adult you probably have compensated for the saccades and developed a strategy that doesn't force you to read linearly. This is much of what "speed reading" courses try to do intentionally.
  
  ProofHouse 2 days ago
  
  also ping pong around the page (ADHD'r). At times I read a sentance or two in linear fashion, then start jumping, or start or move to the end and read backwards, or any mix of this depending.
- numpad0 3 days ago
  
  I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text
  is that crazy? I'm not buying it is
  
  alwa 3 days ago
  
  That description feels relatable to me. Maybe buffered more than buttered, in my case ;)
  It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?
  
  bigbluedots 3 days ago
  
  Don't know, probably? I'm a linear reader
ants_everywhere 3 days ago

some of us with ADHD just kind of read all the words at once

hbarka 3 days ago

Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?

anabis 3 days ago

Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.
- ComputerGuru 2 days ago
  
  It's not throwing any information away because it can be faithfully reconstructed (via an admittedly arduous process), therefore no entropy has been lost (if you consider the sum of both "input bytes" and "knowledge of utf-8 encoding/decoding").
hobofan 3 days ago

Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.
There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.
est 3 days ago

Chinese text == Method of loci
Many Chinese student have good memory to recall a particular paragraph, understand the meaning, but no idea how those words were pronouced.
- yandie 2 days ago
  
  I can read Kanji (Japanese) and sometimes I will understand the sentence but can't pronounce it (Japanese Kanji rules are quite arbitrary). Your brain definitely handles information differently with Chinese letters
  
  est 2 days ago
  
  and if you master the skill, it will speed up your reading dramatically.
  Ideograms could help you establish meanings to graphs directly, skipping the "vocal serialization" single-thread part.

hiddencost 3 days ago

Back before transformers, or even LSTMs, we used to joke that image recognition was so far ahead of language modeling that we should just convert our text to PDF and run the pixels through a CNN.

antirez 2 days ago

This should be "pixels are (maybe) a better representation than the current representation of tokens". Which is very different. Text is surely more information dense than the image containing the same text, so the problem is finding the best representation of text. If each word is expanded to a very large embedding and you see pixels doing better, than the problem is in the representation and not in the text vs image.

daxfohl 2 days ago

It seems like we're still pretty far away from that being viable, if chatgpt is any indication. Whenever it suggests "should I generate an image of that <class design, timeline, data model, etc>, it really helps visualize it!", the result is full of hallucinations.

valine 2 days ago

Image generation and image input are two totally different things. This is about feeding text into LLMs as images, it has nothing to do with image generation.
- daxfohl 2 days ago
  
  Yeah but IIUC they're both just representations of embeddings in a latent space, translated from one format to another. So if the image interpretation of a text embedding is full of hallucinations, it's unlikely that the other direction works well either (again, IIUC).
  That said, I'll be interested to see what the DeepSeek model can do once they've trained it in the other direction. It'd be great to have it output architecture diagrams that actually correspond to what it says in the chat.

alexchamberlain 3 days ago

I'm probably one of the least educated software engineers on LLMs, so apologies if this is a very naive question. Has anyone done any research into just using words as the tokens rather than (if I understand it correctly) 2-3 characters? I understand there would be limitations with this approach, but maybe the models would be smaller overall?

lyu07282 2 days ago

The way modern tokenizers are constructed is by iteratively doing frequency analysis of arbitrary length sequences using a large corpus. So what you suggested is already the norm, tokens aren't n-grams. Words and any sequence really that is common enough will already be one token only, the less frequent a sequence is the more tokens it needs. That's the Byte-pair encoding algorithm:
https://en.wikipedia.org/wiki/Byte-pair_encoding
It's also not lossy compression at all, it's lossless compression if anything, unlike what some people have claimed here.
Shocking comments here, what happened to HN? People are so clueless it reads like reddit wtf
- alexchamberlain 2 days ago
  
  Thanks, that's really interesting. Do they correct for spelling mistakes or internationalised spellings? For example, does `colour` and `color` end up in the same token stream?
  
  lyu07282 2 days ago
  
  No it just looks at exact character sequences, try it out yourself here: https://platform.openai.com/tokenizer
murkt 3 days ago

You will need dictionaries with millions of tokens, which will make models much larger. Also, any word that has too low frequency to appear in the dictionary is now completely unknown to your model.
mhuffman 3 days ago

Along with the other commenter, the reason the dictionary would start getting so big is that words with a stem would have all its variations being different tokens (cat, cats, sit, sitting, etc). Also any out-of-dictionary words or combo words, eg. "cat bed" would not be able to be addressed.
plaguuuuuu 2 days ago

presumably anyone tokenizing chinese characters, which are basically entire words.

superconduct123 2 days ago

> more information compression (see paper) => shorter context windows, more efficiency

I'll ask the dumb question here

How is that possible? Wouldn't different sizes of text/fonts/rendering/spacing end up with way worse compression?

seydor 3 days ago

we re going to get closer and closer to removing all hand-engineered features of neural network architecture, and letting a giant all-to-all fully connected network collapse on its own to the appropriate architecture for the data, a true black box.

justlikereddit 2 days ago

Which is the Logical conclusion.
If the neural network can distill a model out of complex input data.
Especially when many model are frequently trained through data augmentation practices that actively degrade input to achieve generalisation abilities.
Then why are we stuck wearing silk glove tokenizers?

daxfohl 2 days ago

I wouldn't think it would be good for coding assistants, or things where character precision is important.

OTOH maybe the information implied by syntax coloring could make syntax patterns easier to recognize and internalize? Once internalized, perhaps it'd retain and use that syntax understanding on plaintext too if you fine tune it by gradually removing the color coding. Similar approaches have worked for improving their innate (no "thinking", no tool use) arithmetic accuracy.

bigyikes 2 days ago

It might be helpful for intuiting the structure of a program. Imagine if you had to read code all on a single line, with newlines represented with \n.
I can get the feel of a piece of code just by looking at it. Even if you blurred the image, just the shape of the lines of code conveys a lot of information.
- daxfohl 2 days ago
  
  True, but LLMs are already really good at that kind of thing. Even back in 2015, before transformers, here's a karpathy blog post showing how you could find specific neurons that tracked things like indent position, approx column location, long quotes, etc.
  https://karpathy.github.io/2015/05/21/rnn-effectiveness/
  That said, I do think algorithms and system designs are very visual. It's way harder to explain heaps and merge sorts and such from just text and code. Granted, it's 2025 now and modern LLMs seem to have internalized those types of concepts ~perfectly for a while now, so IDK if there's much to gain by changing approaches at that level anymore.
- CamperBob2 2 days ago
  
  Another example might be the way people used to show off their Wordle scores on Twitter when the game first came out. Just posting the gray, green and yellow squares by themselves, sans text, communicates a surprising amount of information about the player's guesses.

cnxhk 3 days ago

The paper is quite interesting but efficiency on OCR tasks does not mean it could be plugged into a general llm directly without performance loss. If you train a tokenizer only on OCR text you might be able to get better compression already.

hunglee2 3 days ago

Really interesting analysis on the latest DeepSeek innovation. I’m tempted to connect it to the information density of logographic script, which DeepSeek engineers would all be natively fluent.

anon291 2 days ago

I made exactly this point at the inaugural Portland AI tinkerers meetup. I had been messing with large document understanding. Converting PDF to text and then sending to gpt was too expensive. It was cheaper to just upload the image and ask it questions directly. And about as accurate.

https://portland.aitinkerers.org/talks/rsvp_fGAlJQAvWUA

yalogin 2 days ago

I don’t quite follow. The way I see it, I hat the llm “reads” depends on the input modality. If the input is a human it will be in text form, has to be. If the input is through a camera then yes, even text will be camera frames and pixels, and that is how I expect the llms to process. So I would a vision llm would already be doing this.

danans 2 days ago

> if the input is a human it will be in text form, has to be.
Why can't it be a sequence of audio waveforms from human speech?

bahmboo 2 days ago

Not criticizing per se but I just watched this recent (and great!) interview where he extols how special written language is. That was my take away at least. Still trying to wrap my head around this vision encoder approach. He’s way smarter than me! https://youtu.be/lXUZvyajciY

nottorp 2 days ago

The text should be printed and a photo of the printed paper on a wooden table should be passed as input into the LLM.

taneq a day ago

All questions must now be posed to the Oracle through interpretive dance.

bonoboTP 2 days ago

Sometimes you want to be Unicode-precise, such as when checking if domain names are legit.

redbell 2 days ago

For reference, here's the paper: https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...

cschmidt 2 days ago

There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.

pcwelder 3 days ago

There are many unicode characters that look alike. There are also those zero width characters.

foundonechar 2 days ago

[dead]

jimdavid 3 days ago

Did anyone check the token feature dimension? If we're talking about compression, "token length" is just one of the dimensions.

qarl 2 days ago

Hm.

When I think to myself, I hear words stream across my inner mind.

It's not pages of text. It's words.

yunwal 4 days ago

> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

fspeech 3 days ago

If you can read your input on your screen your computer apparently knows how to convert your texts to images.
smegma2 4 days ago

No? He’s talking about rendered text
- rhdunn 3 days ago
  
  From the post he's referring to text input as well:
  > Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
  Italicized emphasis mine.
  So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
  Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.
CuriouslyC 3 days ago

All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.
awesome_dude 3 days ago

I mean, text is, after all, highly stylised images
It's trivial for text to be pasted in, and converted to pixels (that's what my, and every computer on the planet, does when showing me text)

InkCanon 2 days ago

Could someone explain to me the difference? They both get turned to tensors of floats.

0x264 2 days ago

JavaScript code and Haskell code ultimately both get turned into instructions for a microprocessor, so there really isn't much of a difference between both.

rustyconover 2 days ago

Yet again Hollywood is prescient. This post reminds me of the language of the aliens in Arrival. It seems like the OP would see that as a reasonable input to an LLM.

ninetyninenine 2 days ago

eh, some part of the model will be translating those pixels into tokens. We're just moving the extra step into the blackbox.

bni 3 days ago

Of course PowerPoint is the best input to LLMs. They will come to that eventually.

brokencode 3 days ago

I'd actually prefer to communicate to ChatGPT via Microsoft Paint. Much more efficient than typing.
- saaaaaam 2 days ago
  
  Leading scientists claim interpretative dance is the AI breakthrough the world has been waiting for!
  
  falcor84 2 days ago
  
  In all seriousness, I found those sorting dance videos to be really educationally effective (when coupled with going over the pseudocode) - e.g. https://youtu.be/3San3uKKHgg?si=09EQYJNIkRqvQgWG
jtwaleson 3 days ago

It's slides all the way down. Once models support this natively, it's a major threat to slides ai / gamma and the careers of product managers.
cat5e 3 days ago

Yeah, I’ve seen great results with this approach.
amelius 2 days ago

Clippy knew this all along.

dgfitz 3 days ago

[flagged]

scotty79 3 days ago

It's kind of beautiful that they can actually do that.