Is this able to convert pdf flowcharts into yaml or json representations of them? I have been experimenting with Claude 3.5. It has been very good at readig / understanding/ converting into representations of flow charts.
So I am wondering if this is more capable. Will try definitely, but maybe someone can chime in.
This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.
Specifically, this allows you to associate figure references with the actual figure, which would allow me to build a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.
It also allows a clean conversion to HTML, so you can add cool functionality like clicking on unfamiliar words for definitions, or inserting LLM generated checkpoint questions to verify understanding. I would like to see if I can automatically integrate Andy Matuschak's Orbit[0] SRS into any PDF.
The output includes images from the input. You can see that on one of the examples where a logo is cropped out of the source and included in the result.
We're approaching the point where OCR becomes "solved" — very exciting! Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.
However IMO, there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.
You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. But the future is on the horizon!
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)
One problem I’ve encountered at my small startup in evaluating OCR technologies is precisely convincing stakeholders that the “human-in-the-loop” part is both unavoidable, and ultimately beneficial.
PMs want to hear that an OCR solution will be fully automated out-of-the-box. My gut says that anything offering that is snake-oil, and I try to convey that the OCR solution they want is possible, but if you are unwilling to pay the tuning cost, it’s going to flop out of the gate. At that point they lose interest and move on to other priorities.
Yup definitely, and this is exactly why I built my startup. I've heard this a bunch across startups & large enterprises that we work with. 100% automation is an impossible target, because even humans are not 100% perfect. So how we can expect LLMs to be?
But that doesn't mean you have to abandon the effort. You can still definitely achieve production-grade accuracy! It just requires having the right tooling in place, which reduces the upfront tuning cost. We typically see folks get there on the order of days or 1-2 weeks (it doesn't necessarily need to take months).
yeah that's a fun challenge — what we've seen work well is a system that forces the LLM to generate citations for all extracted data, map that back to the original OCR content, and then generate bounding boxes that way. Tons of edge cases for sure that we've built a suite of heuristics for over time, but overall works really well.
>> Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.
-OR- they can just use these APIs, and considering that they have a client base - which would prefer to not rewrite integrations to get the same result - they can get rid of most code base, replace it with llm api and increase margins by 90% and enjoy good life.
It's not about harder but about what error you can tolerate. Here if you have accuracy 99% for many applications it's enough. If you have 99% accuracy per trip of no crash during self driving then you gonna be dead within a year very likely.
For cars we need accuracy at least 99.99% and that's very hard.
I doubt most people have 99% accuracy. The threshold of tolerance for error is just much lower for any self-driving system (and with good reason, because we're not familiar with them yet).
I guess something like success rate for a trip (or mile) would be a more reasonable metric. Most people have a success rate far higher than 99% for averages trips.
Most people who commute daily are probably doing something like a 1000 car rides a year and have minor accidents every few years. 99% success rates would mean monthly accidents.
Great progress, but unfortunately, for our use case (converting medical textbooks from PDF to MD), the results are not as good as those by MinerU/PDF-Extract-Kit [1].
Also the collab link in the article is broken, found a functional one [2] in the docs.
This is cool! With that said for anyone looking to use this in RAG, the downside to specialized models instead of general VLMs is you can't easily tune it to your use specific case. So for example, we use Gemini to add very specific alt text to images in the extracted Markdown. It's also 2 - 3X the cost of Gemini Flash - hopefully the increased performance is significant.
Regardless excited to see more and more competition in the space.
gemini flash is notorious for hallucinating the output of the OCR, be careful with it. For straight forward, semi-structured, low page count (under 5) it should perform well, but the more the context window is stretched the more the output gets more unreliable
Dang. Super fast and significantly more accurate than google, Claude and others.
Pricing : $1/1000 pages, or per 2k pages if “batched”. I’m not sure what batching means in this case: multiple pdfs? Why not split them to halve the cost?
From my testing so far, it seems it's super fast and responded synchronously. But it decided that the entire page is an image and returned `` with coordinates in the metadata for the image, which is the entire page.
Our tool, doctly.ai is much slower and async, but much more accurate and gets you the content itself as an markdown.
Haha for sure. Naming isn't just the hardest problem in computer science, it's always hard. But at some point you just have to pick something and move forward.
May I ask as a layperson, how would you about using this to OCR multiple hundreds of pages?
I tried the chat but it pretty much stops after the 2nd page.
You can check the example code on the Mistral documentation, you would _only_ have to change the value of the variable `document_url` to the URL of your uploaded PDF... and you need to change the `MISTRAL_API_KEY` to the value of your specific key that you can get from the Le Platforme webpage.
Usually (With OpenAI, I haven't checked Mistral yet) it means an async api rather than a sync api.
e.g. you submit multiple requests (pdfs) in one call, and get back an id for the batch. You then can check on the status of that batch and get the results for everything when done.
It lets them use their available hardware to it's full capacity much better.
I noticed on the Arabic example they lost a space after the first letter on the third to last line, can any native speakers confirm? (I only know enough Arabic to ask dumb questions like this, curious to learn more.)
Edit: it looks like they also added a vowel mark not present in the input on the line immediately after.
Edit2: here's a picture of what I'm talking about, the before/after: https://ibb.co/v6xcPMHv
I am pretty sure it added a kasrah not present in the input on the 2nd to last line. (Not saying it's not super impressive, and also that almost certainly is the right word, but I think that still means not quite "perfect"?)
Related, does anyone know of an app that can read gauges from an image and log the number to influx? I have a solar power meter in my crawlspace, it is inconvenient to go down there. I want to point an old phone at it and log it so I can check it easily. The gauge is digital and looks like this:
You'll be happier finding a replacement meter that has an interface to monitor it directly or a second meter. An old phone and OCR will be very brittle.
Not OP, but it sounds like the kind of project I’d undertake.
Happiness for me is about exploring the problem within constraints and the satisfaction of building the solution. Brittleness is often of less concern than the fun factor.
And some kinds of brittleness can be managed/solved, which adds to the fun.
I would posit that learning how the device works, and how to integrate with a newer digital monitoring device would be just as interesting and less brittle.
Possibly! But I’ve recently wanted to dabble with computer vision, so I’d be looking at a project like this as a way to scratch a specific itch. Again, not OP so I don’t know what their priorities are, but just offering one angle for why one might choose a less “optimal” approach.
This[1] is something I've come across but not had a chance to play with, designed for reading non-smart meters that might work for you. I'm not sure if there's any way to run it on an old phone though.
...start digging around and you'll likely find something. HA has integrations which can support writing to InfluxDB (local for sure, and you can probably configure it for a remote influxdb).
You're looking at 1xRaspberry PI, 1xUSB Webcam, 1x"Power Management / humidity management / waterproof electrical box" to stuff it into, and then either YOLO and DIY to shoot over to your influxdb, or set up a Home Assistant and "attach" your frankenbox as some sort of "sensor" or "integration" which spits out metrics and yadayada...
4o transcribes it perfectly. You can usually root an old Android and write this app in ~2h with LLMs if unfamiliar. The hard part will be maintaining camera lens cleanliness and alignment etc.
The time cost is so low that you should give it a gander. You'll be surprised how fast you can do it. If you just take screenshots every minute it should suffice.
6 years ago I was working with a very large enterprise that was struggling to solve this problem, trying to scan millions of arbitrary forms and documents per month to
clearly understand key points like account numbers, names and addresses, policy numbers, phone numbers, embedded images or scribbled notes, and also draw relationships between these values on a given form, or even across forms.
I wasn't there to solve that specific problem but it was connected to what we were doing so it was fascinating to hear that team talk through all the things they'd tried, from brute-force training on templates (didn't scale as they had too many kinds of forms) to every vendor solution under the sun (none worked quite as advertised on their data)..
I have to imagine this is a problem shared by so many companies.
It’s interesting that none of the existing models can decode a Scrabble board screen shot and give an accurate grid of characters.
I realize it’s not a common business case, came across it testing how well LLMs can solve simple games. On a side note, if you bypass OCR and give models a text layout of a board standard LLMs cannot solve Scrabble boards but the thinking models usually can.
Nice demos but I wonder how well it does on longer files. I've been experimenting with passing some fairly neat PDFs to various LLMs for data extraction. They're created from Excel exports and some of the data is cut off or badly laid out, but it's all digitally extractable.
The challenge isn't so much the OCR part, but just the length. After one page the LLMs get "lazy" and just skip bits or stop entirely.
And page by page isn't trivial as header rows are repeated or missing etc.
So far my experience has definitely been that the last 2% of the content still takes the most time to accurately extract for large messy documents, and LLMs still don't seem to have a one-shot solve for that. Maybe this is it?
You will have to send one page at a time, most of this work has to be done via RAG. Adding a large context (like a whole PDF), still does not work that well in my experience.
We developers seem to really dislike PDFs, to a degree that we'll build LLMs and have them translate it into Markdown.
Jokes aside, PDFs really serve a good purpose, but getting data out of them is usually really hard. They should have something like an embedded Markdown version with a JSON structure describing the layout, so that machines can easily digest the data they contain.
I feel this is created for RAG. I tried a document [0] that I tested with OCR; it got all the table values correctly, but the page's footer was missing.
Headers and footers are a real pain with RAG applications, as they are not required, and most OCR or PDF parsers will return them, and there is extract work to do to remove them.
The new Mistral OCR release looks impressive - 94.89% overall accuracy and significantly better multilingual support than competitors. As someone who's built document processing systems at scale, I'm curious about the real-world implications.
Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.
Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.
I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.
Excited to test this our on our side as well. We recently built an OCR benchmarking framework specifically for VLMs[1][2], so we'll do a test run today.
From our last benchmark run, some of these numbers from Mistral seem a little bit optimistic. Side by side of a few models:
model | omni | mistral |
gemini | 86% | 89% |
azure | 85% | 89% |
gpt-4o | 75% | 89% |
google | 68% | 83% |
Currently adding the Mistral API and we'll get results out today!
At my client we want to provide an AI that can retrieve relevant information from documentation (home building business, documents detail how to install a solar panel or a shower, etc) and we've set up an entire system with benchmarks, agents, etc, yet the bottleneck is OCR!
We have millions and millions of pages of documents and an off by 1 % error means it compounds with the AI's own error, which compounds with documentation itself being incorrect at times, which leads it all to be not production ready (and indeed the project has never been released), not even close.
We simply cannot afford to give our customers incorrect informatiin
We have set up a backoffice app that when users have questions, it sends it to our workers along the response given by our AI application and the person can review it, and ideally correct the ocr output.
Honestly after an year of working it feels like AI right now can only be useful when supervised all the time (such as when coding). Otherwise I just find LLMs still too unreliable besides basic bogus tasks.
As someone who has had a home built, and nearly all my friends and acquaintances report the same thing, having a 1% error on information in this business would mean not a 10x but a 50x improvement over the current practice in the field.
If nobody is supervising building documents all the time during the process, every house would be a pile of rubbish. And even when you do stuff stills creeps in and has to be redone, often more than once.
I have done OCR on leases. It’s hard. You have to be accurate and they all have bespoke formatting.
It would almost be easier to switch everyone to a common format and spell out important entities (names, numbers) multiple times similar to how cheques do.
The utility of the system really depends on the makeup of that last 5%. If problematic documents are consistently predictable, it’s possible to do a second pass with humans. But if they’re random, then you have to do every doc with humans and it doesn’t save you any time.
re: real world implications, LLMs and VLMs aren't magi, and anyone who goes in expecting 100% automation is in for a surprise (especially in domains like medical or legal).
IMO there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases.
e.g. you still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort.
But for RAG and other use cases where the error tolerance is higher, I do think these OCR models will get good enough to just solve that part of the problem.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)
I'd love to try it for my domain (regulation), but $1/1000 pages is significantly more expensive than my current local Docling based setup that already does a great job of processing PDF's for my needs.
I think for regulated fields / high impact fields $1/1000 is well-worth the price; if the accuracy is close to 100% this is way better than using people, who are still error-prone
I actually think you're onto something there. The "MicroLLMs Architecture" could mirror how microservices revolutionized web architecture.
Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.
The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.
The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.
Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.
I’m doing this personally for my own project - essentially building an agent graph that starts with the image output, orients and cleans, does a first pass with tesseract LSTM best models to create PDF/HOCR/Alto, then pass to other LLMs and models based on their strengths to further refine towards markdown and latex. My goal is less about RAG database population but about preserving in a non manually typeset form the structure and data and analysis, and there seems to be pretty limited tooling out there since the goal generally seems to be the obviously immediately commercial goal of producing RAG amenable forms that defer the “heavy” side of chart / graphic / tabular reproduction to a future time.
This is already done with agents. Some agents only have tools and the one model, some agents will orchestrate with other LLMs to handle more advanced use cases. It's pretty obvious solution when you think about how to get good performance out of a model on a complex task when useful context length is limited: just run multiple models with their own context and give them a supervisor model—just like how humans organize themselves in real life.
Making Transformers the same cost as CNN's (which are used in character-level ocr, as opposed to image-patch-level) is a good thing. The problem with CNN based character-level OCR is not the recognition models but the detection models. In a former life, I found a way to increase detection accuracy, and, therefore, overall OCR accuracy, and used that as an enhancement on top of Amazon and Google OCR. It worked really well. But the transformer approach is more powerful and if it can be done for $1 per 1000 pages, that is a game changer, IMO, at least of incumbents offering traditional character-level OCR.
I would like to see how it performs with massively warped and skewed scanned text images, basically a scanned image where the text lines are wavy as opposed as straight horizontal, where the letters are elongated. One where the line widths are different depending on the position on the scanned image. I once had to deal with such a task that somebody gave me with OCR software, Acrobat, and other tools could not decode the mess so I had to recreate the 30 pages myself, manually. Not a fun thing to do but that is a real use case.
I wonder how good it would be to convert sheet music to MusicXML. All the current tools more or less suck with this task, or maybe I’m just ignorant and don’t know what lego bricks to put together.
Is there a reliable handwriting OCR benchmark out there (updated, not a blog post)? Despite the gains claimed for printed text, I found (anecdotally) that trying to use Mistral OCR on my messy cursive handwriting to be much less accurate than GPT-4o, in the ballpark of 30% wrong vs closer to 5% wrong for GPT-4o.
Simon Willison linked to an impressive demo of Qwen2-VL in this area: I haven't found a version of it that I could run locally yet to corroborate. https://simonwillison.net/2024/Sep/4/qwen2-vl/
One of my hobby projects while in University was to do OCR on book scans. Doing character recognition was solved, but finding the relationship between characters was very difficult. I tried "primitive" neural nets, but edge cases would often break what I built. Super cool to me to see such an order of magnitude in improvement here.
Does it do hand written notes and annotations? What about meta information like highlighting? I am also curious if LLMs will get better because more access to information if it can be effectively extracted from PDFs.
Does this support Japanese? They list a table of language comparisons againat other approaches but I can't tell if it is exhaustive.
I'm hoping that something like this will be able to handle 3000-page Japanese car workshop manuals. Because traditional OCR really struggles with it. It has tables, graphics, text in graphics, the whole shebang.
It will be interesting to see how all the companies in the document processing space adapt as OCR becomes a commodity.
The best products will be defined by everything "non-AI", like UX, performance and reliability at scale, and human-in-the loop feedback for domain experts.
They will offer integrations into enterprise systems, just like they do today.
Lots of big companies don't like change. The existing document processing companies will just silently start using this sort of service to up their game, and keep their existing relationships.
I 100% agree with this, I think you can even extend this to any AI, in the end, IMO, as the llm is more commoditized, the surface of which the value is delivered will matter more
I was just watching a science-related video containing math equations. I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.
It's only a matter of time before "browsing" means navigating HTTP sites via LLM prompts. although, I think it is critical that LLM input should NOT be restricted to verbal cues. Not everyone is an extrovert that longs to hear the sound of their own voices. A lot of human communication is non-verbal.
Once we get over the privacy implications (and I do believe this can only be done by worldwide legislative efforts), I can imagine looking at a "website" or video, and my expressions, mannerisms and gestures will be considered prompts.
At least that is what I imagine the tech would evolve into in 5+ years.
> I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.
While those are concerns, my point was that having everything on the internet navigated to, digested and explained to me sounds unpleasant and overall a drain on my ability to think and reason for myself.
It is specifically how you describe using the tech that provokes a feeling of revulsion to me.
Then I think you misunderstand. The ML system would know when you want things digested to you or not. Right now companies are assuming this and forcing LLM interaction. But when properly done, the system would know based on your behavior or explicit prompts what you want and provide the service. If you're staring at a paragraph intently and confused, it might start highlighting common phrases or parts of the text/picture that might be hard to grasp and based on your reaction to that, it might start describing things via audio,tool tips,side pane,etc.. In other words, if you don't like how and when you're interacting with the LLM ecosystem, then that is an immature and failing ecosystem, in my vision this would be a largely solved problems, like how we interact with keyboards,mouse and touchscreens today.
Now? OK, you need to screencap and upload to LLM, but that's well established tech by now. (Where by "well established", I mean at least 9 months old ;)
Same goes for "navigating HTTP sites via LLM prompts". Most LLMs have web search integration, and the "Deep Research" variants do more complex navigation.
Video chat is there partially, as well. It doesn't really pay much attention to gestures & expressions, but I'd put the "earliest possible" threshold for that a good chunk closer than 5 years.
Yeah, all these things are possible today, but getting them well polished and integrated is another story. Imagine all this being supported by "HTML6" lol. When apple gets around to making this part of safari, then we know it's ready.
But kidding aside - I'm not sure people want this being supported by web standards. We could be a huge step closer to that future had we decided to actually take RDF/Dublin Core/Microdata seriously. (LLMs perform a lot better with well-annotated data)
The unanimous verdict across web publishers was "looks like a lot of work, let's not". That is, ultimately, why we need to jump through all the OCR hoops. Not only did the world not annotate the data, it then proceeded to remove as many traces of machine readability as possible.
So, the likely gating factor is probably not Apple & Safari & "HTML6" (shudder!)
If I venture my best bet what's preventing polished integration: It's really hard to do via foundational models only, and the number of people who want to have deep & well-informed conversations via a polished app enough that they're willing to pay for an app that does that is low enough that it's not the hot VC space. (Yet?)
Crystal ball: Some OSS project will probably get within spitting distance of something really useful, but also probably flub the UX. Somebody else will take up these ideas while it's hot and polish it in a startup. So, 18-36 months for an integrated experience from here?
It worked perfectly for me with a simple 2 page PDF that contained no graphics or formatting beyond headers and list items. Since it was so small I had the time to proof-read it and there were no errors. It added some formatting, such as bolding headers in list items and putting tics around file and function names. I won't complain.
I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.
I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:
```

```
I'll keep testing, but so far, very disappointing :(
This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.
Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.
I would have loved to add this into the judge list, but might have to skip it.
Where did you test it? At the end of the post they say:
> Mistral OCR capabilities are free to try on le Chat
but when asked, Le Chat responds:
> can you do ocr?
> I don't have the capability to perform Optical Character Recognition (OCR) directly. However, if you have an image with text that you need to extract, you can describe the text or provide details, and I can help you with any information or analysis related to that text. If you need OCR functionality, you might need to use a specialized tool or service designed for that purpose.
Edit: Tried anyway by attaching an image; it said it could do OCR and then output... completely random text that had absolutely nothing to do with the text in the image!... Concerning.
Tried again with a better definition image, output only the first twenty words or so of the page.
Interestingly I’m currently going through and scanning the hundreds of journal papers my grandfather authored in medicine and thinking through what to do about graphs. I was expecting to do some form of multiphase agent based generation of LaTeX or SVG rather than a verbal summary of the graphs. At least in his generation of authorship his papers clearly explained the graphs already. I was pretty excited to see your post naturally but when I looked at the examples what I saw was, effectively, a more verbose form of
```  ```
I’m assuming this is partially because your use case is targeting RAG under various assumptions bur also partially because multimodal models aren’t near what I would need to be successful with?
We need to update the examples on the front page. Currently for things that are considered charts/graphs/figures we convert to a description. For things like logos or images we do an image tag. You can also choose to exclude them.
The difference with this is that it took the entire page as an image tag (it's just a table of text in my document). rather than being more selective.
I do like that they give you coordinates for the images though, we need to do something like that.
Give the actual tool a try. Would love to get your feedback for that use case. It gives you 100 free credits initially but if you email me (ali@doctly.ai), I can give you an extra 500 (goes for anyone else here also)
If you have a judge system, and Mistral performs well on other tests, wouldn't you want to include it so if it scores the highest by your judges ranking it would select the most accurate result? Or are you saying that mistral's image markdown would score higher on your judge score?
We'll definitely be doing more tests, but the results I got on the complex tests would result in a lower score and might not be worth the extra cost of the judgement itself.
In our current setup Gemini wins most often. We enter multiple generations from each model into the 'tournament', sometimes one generation from gemini could be at the top while another in the bottom, for the same tournament.
We've been getting great results with those aswell. But ofcourse there is always some chance of not getting it perfect, specially with different handwritings.
Give it a try, no credit cards needed to try it. If you email me (ali@doctly.ai) i can give you extra free credits for testing.
Great question. The language models are definitely beating the old tools. Take a look at Gemini for example.
Doctly runs a tournament style judge. It will run multiple generations across LLMs and pick the best one. Outperforming single generation and single model.
It's useful to have the plain text down the line for operations not involving a language model (e.g. search). Also if you have a bunch of prompts you want to run it's potentially cheaper, although perhaps less accurate, to run the OCR once and save yourself some tokens or even use a smaller model for subsequent prompts.
Tons of uses: Storage (text instead of images), search (user typing in a text box and you want instant retrieval from a dataset), etc. And costs: run on images once - then the rest of your queries will only need to run on text.
If you are working with PDF, I would suggest a hybrid process.
It is feasible to extract information with 100% accuracy from PDFs that were generated using the mappable acrofields approach. In many domains, you have a fixed set of forms you need to process and this can be leveraged to build a custom tool for extracting the data.
Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.
The moment you need to use this kind of technology you are in a completely different regime of what the business will (should) tolerate.
> Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.
It's always safer to OCR on every file. Sometimes you'll have a "clean" pdf that has a screenshot of an Excel table. Or a scanned image that has already been OCR'd by a lower quality tool (like the built in Adobe OCR). And if you rely on this you're going to get pretty unpredictable results.
It's way easier (and more standardized) to run OCR on every file, rather than trying to guess at the contents based on the metadata.
It's not guessing if the form is known and you can read the information directly.
This is a common scenario at many banks. You can expect nearly perfect metadata for anything pushed into their document storage system within the last decade.
Oh yea if the form is known and standardized everything is a lot easier.
But we work with banks on our side, and one of the most common scenarios is customers uploading financials/bills/statements from 1000's of different providers. In which case it's impossible to know every format in advance.
The hard ones are things like contracts, leases, and financial documents which 1) don’t have a common format 2) are filled with numbers proper nouns and addresses which it’s really important not to mess up 3) cannot be inferred from context.
Typical OCR pipeline would be to pass the doc through a character-level OCR system then correct errors with a statistical model like an LLM. An LLM can help correct “crodit card” to “credit card” but it cannot correct names or numbers. It’s really bad if it replaces a 7 with a 2.
Given the wide variety of pricing on all of these providers, I keep wondering how the economics work. Do they have fantastic margin on some of these products or is it a matter of subsidizing the costs, hoping to capture the market? Last I heard, OpenAI is still losing money.
Looks good but in the first hover/slider demo one can see how it could lead to confusion when handling side by side content.
Table 1 is referred to in section `2 Architectural details` but before `2.1 Multimodal Decoder`. In the generated markdown though it is below the latter section, as if it was in/part of that section.
Of course I am nitpicking here but just the first thing I noticed.
> Preserving historical and cultural heritage: Organizations and nonprofits that are custodians of heritage have been using Mistral OCR to digitize historical documents and artifacts, ensuring their preservation and making them accessible to a broader audience.
Bit unrelated but is there anything that can help with really low resolution text? My neighbor got hit and run the other day for example, and I've been trying every tool I can to make out some of the letters/numbers on the plate
Its in CA, looks like paper plates which follow a specific format and the last two seem to be the numbers '64'. Police should be able to search for temp tag with partial match and match the make/model. Was curious to see if any software could help though
If you know the font in advance (which you often do in these cases) you can do insane reconstructions. Also keep in mind that it doesn't have to be a perfect match, with the help of the color and other facts (such as likely location) about the car you can narrow it down significantly.
So, the only thing that stopped AI from learning from all our science and taking over the world was the difficulty of converting PDFs of academic papers to more computer readable formats.
Just tested with a multilingual (bidi) English/Hebrew document.
The Hebrew output had no correspondence to the text whatsoever (in context, there was an English translation, and the Hebrew produced was a back-translation of that).
Their benchmark results are impressive, don't get me wrong. But I'm a little disappointed. I often read multilingual document scans in the humanities. Multilingual (and esp. bidi) OCR is challenging, and I'm always looking for a better solution for a side-project I'm working on (fixpdfs.com).
Also, I thought OCR implied that you could get bounding boxes for text (and reconstruct a text layer on a scan, for example). Am I wrong, or is this term just overloaded, now?
Mathpix is ace. That’s the best results I got so far for scientific papers and reports. It understands the layout of complex documents very well, it’s quite impressive. Equations are perfect, figures extraction works well.
There are a few annoying issues, but overall I am very happy with it.
Curious to see how this performance against more real world usage of someone taking a photo of text (which the text then becomes slightly blurred) and performing OCR on it.
I can't exactly tell if the "Mistral 7B" image is an example of this exact scenario.
It's disappointing to see that the benchmark results are so opaque. I hope we see reproducible results soon, and hopefully from Mistral themselves.
1. We don't know what the evaluation setup is. It's very possible that the ranking would be different with a bit of prompt engineering.
2. We don't know how large each dataset is (or even how the metrics are calculated/aggregated). The metrics are all reported as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]
3. We don't know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn't work, and deducing that it's probably bad data and removing it.)
Given the fact that multi-modal LLMs are getting so good at OCR these days, is it a shame that we can't do local OCR with high accuracy in the near-term?
Great example of how information is sometimes compartmentalized arbitrarily in the brain: I imagine you have never been confused by sentences such as “I’m running at 10 km/h”.
Dollar signs go before the number, not after it like units. It needs to be 1000 pages/$1 to make sense, whereas 10km and 10h and 10/h all make sense so 10km/h does. I imagine you would be confused by km/h 10 but not $10.
or when + becomes 4 and isn't caught during review
I wonder if a superimposed copy of the document on the original (with coloring or other highlighting of the diff) would help to catch some important errors.. the model would have to include layout of the copy in addition to the text/images, which I think is a little beyond SOTA but attainable.
Perusing the web site, it's depressing how much behind Mistral is on basic "how can I make this a compelling hook for customers" for the page.
The notebook link? An ACL'd doc
The examples don't even include a small text-to-markdown sample.
The before/after slider is cute, but useless - SxS is a much better way to compare.
Trying it in "Le Chat" requires a login.
It's like an example of "how can we implement maximum loss across our entire funnel". (I have no doubt the underlying tech does well, but... damn, why do you make it so hard to actually see it, Mistral?)
If anybody tried it and has shareable examples - can you post a link? Also, anybody tried it with handwriting yet?
Pretty cool, would love to use this with paperless, but I just can't bring myself to send a photo of all my documents to a third party, especially legal and sensitive documents, which is what I use Paperless for.
Because of that I'm stuck with crappy vision on Ollama (Thanks to AMDs crappy ROCm support for Vllm)
LLM based OCR is a disaster, great potential for hallucinations and no estimate of confidence. Results might seem promising but you’ll always be wondering.
CNN-based OCR also have "hallucinations" and Transformers aren't that much different in that respect. This is a problem solved with domain specific post-processing.
Mistral is Europe based where invoices are more or less sent digitally in like 95% of all the cases anyway. Some are even digital invoices, which will at some point in the eu be mandatory. For orders there are proposals for that, too. And basically invoice data extraction is a different beast.
So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.
Your best bet is to always convert it to an image and OCR it to extract structured data.
One use-case is digitising receipts from business related travels for expenses that employees paid for out of their own pocket and which they are submitting pictures to the business for reimbursement.
Bus travels, meals including dinners and snacks, etc. for which the employee has receipts on paper.
Another good example would be contracts of any kind. Imagine photographing a contract (like a car loan) and on the spot getting an AI to read it, understand it, forecast scenarious, highlight red flags, and do some comparison shopping for you.
I wanted to apply OCR to my company's invoicing since they basically did purchasing for a bunch of other large companies, but the variability in the conversion was not tolerable. Even rounding something differently could catch an accountant's eye, let alone detecting a "8" as a "0" or worse.
To be fair: Reading the blog post, the main objective seems to have been to enable information extraction with high confidence for the academic sector (e.g. unlocking all these paper pdfs), and not necessarily to be another receipt scanner.
Agreed. In general I've had such bad performance for complex table based invoice parsing, that every few months I try the latest models to see if its better. It does say "96.12" on top-tier benchmark under the Table category.
Such a shame that PDF doesn’t just, like, include the semantic structure of the document by default. It is brilliant that we standardized on an archival document format that doesn’t include direct access to the document text or structure as a core intrinsic default feature.
I say this with great anger as someone who works in accessibility and has had PDF as a thorn in my side for
30 years.
I agree with this so much. I've tried to sometimes push friends and family to use text formats (at least I sent them something like Markdown), which is very easy to render in the browser anyways. But often you have to fall back to PDF, which I dislike very much. There's so much content like books and papers that are in PDF as well. Why did we pick a binary blob as shareable format again?
Even assuming you could get people to do the work (probably the real issue here) could a single schema syntax capture the semantics of the universe of documents that exist as PDFs? PDFs succeeded because they could reproduce anything.
It's a weird timing because I just launched https://dochq.io - ai document extraction where you can define what you need to get out your documents in plain English, I legitimately thought that this was going to be such a niche product but hell, there has been a very rapid rise for AI-based OCR lately, an article/tweet even went viral 2 weeks ago I think? About using Gemini to do OCR, fun times.
Is this able to convert pdf flowcharts into yaml or json representations of them? I have been experimenting with Claude 3.5. It has been very good at readig / understanding/ converting into representations of flow charts.
So I am wondering if this is more capable. Will try definitely, but maybe someone can chime in.
This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.
Specifically, this allows you to associate figure references with the actual figure, which would allow me to build a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.
It also allows a clean conversion to HTML, so you can add cool functionality like clicking on unfamiliar words for definitions, or inserting LLM generated checkpoint questions to verify understanding. I would like to see if I can automatically integrate Andy Matuschak's Orbit[0] SRS into any PDF.
Lots of potential here.
[0] https://docs.withorbit.com/
Wait does this deal with images?
The output includes images from the input. You can see that on one of the examples where a logo is cropped out of the source and included in the result.
We're approaching the point where OCR becomes "solved" — very exciting! Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.
However IMO, there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.
You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. But the future is on the horizon!
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)
One problem I’ve encountered at my small startup in evaluating OCR technologies is precisely convincing stakeholders that the “human-in-the-loop” part is both unavoidable, and ultimately beneficial.
PMs want to hear that an OCR solution will be fully automated out-of-the-box. My gut says that anything offering that is snake-oil, and I try to convey that the OCR solution they want is possible, but if you are unwilling to pay the tuning cost, it’s going to flop out of the gate. At that point they lose interest and move on to other priorities.
Yup definitely, and this is exactly why I built my startup. I've heard this a bunch across startups & large enterprises that we work with. 100% automation is an impossible target, because even humans are not 100% perfect. So how we can expect LLMs to be?
But that doesn't mean you have to abandon the effort. You can still definitely achieve production-grade accuracy! It just requires having the right tooling in place, which reduces the upfront tuning cost. We typically see folks get there on the order of days or 1-2 weeks (it doesn't necessarily need to take months).
The challenge I have is how to get bounding boxes for the OCR, for things like redaction/de-identification.
yeah that's a fun challenge — what we've seen work well is a system that forces the LLM to generate citations for all extracted data, map that back to the original OCR content, and then generate bounding boxes that way. Tons of edge cases for sure that we've built a suite of heuristics for over time, but overall works really well.
>> Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.
-OR- they can just use these APIs, and considering that they have a client base - which would prefer to not rewrite integrations to get the same result - they can get rid of most code base, replace it with llm api and increase margins by 90% and enjoy good life.
I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF.
I never thought driving a car is harder than editing a pdf.
It's not about harder but about what error you can tolerate. Here if you have accuracy 99% for many applications it's enough. If you have 99% accuracy per trip of no crash during self driving then you gonna be dead within a year very likely.
For cars we need accuracy at least 99.99% and that's very hard.
I doubt most people have 99% accuracy. The threshold of tolerance for error is just much lower for any self-driving system (and with good reason, because we're not familiar with them yet).
How do you define 99% accuracy?
I guess something like success rate for a trip (or mile) would be a more reasonable metric. Most people have a success rate far higher than 99% for averages trips.
Most people who commute daily are probably doing something like a 1000 car rides a year and have minor accidents every few years. 99% success rates would mean monthly accidents.
I've been able to edit PDFs (95%+ of them) accurately for the past 10 years...
Foxit PDF exists...
Great progress, but unfortunately, for our use case (converting medical textbooks from PDF to MD), the results are not as good as those by MinerU/PDF-Extract-Kit [1].
Also the collab link in the article is broken, found a functional one [2] in the docs.
[1] https://github.com/opendatalab/MinerU [2] https://colab.research.google.com/github/mistralai/cookbook/...
I've been searching relentlessly for something like this! I wonder why it's been so hard to find... is it the Chinese?
In any case, thanks for sharing.
This is cool! With that said for anyone looking to use this in RAG, the downside to specialized models instead of general VLMs is you can't easily tune it to your use specific case. So for example, we use Gemini to add very specific alt text to images in the extracted Markdown. It's also 2 - 3X the cost of Gemini Flash - hopefully the increased performance is significant.
Regardless excited to see more and more competition in the space.
Wrote an article on it: https://www.sergey.fyi/articles/gemini-flash-2-tips
gemini flash is notorious for hallucinating the output of the OCR, be careful with it. For straight forward, semi-structured, low page count (under 5) it should perform well, but the more the context window is stretched the more the output gets more unreliable
Dang. Super fast and significantly more accurate than google, Claude and others.
Pricing : $1/1000 pages, or per 2k pages if “batched”. I’m not sure what batching means in this case: multiple pdfs? Why not split them to halve the cost?
Anyway this looks great at pdf to markdown.
Batched often means a higher latency option (minutes/hours instead of seconds), which providers can schedule more efficiently on their GPUs.
Batching likely means the response is not real-time. You set up a batch job and they send you the results later.
If only business people I work with would understand 100GB even transfer over the network is not going to return immediately results ;)
That makes sense. Idle time is nearly free after all.
From my testing so far, it seems it's super fast and responded synchronously. But it decided that the entire page is an image and returned `` with coordinates in the metadata for the image, which is the entire page.
Our tool, doctly.ai is much slower and async, but much more accurate and gets you the content itself as an markdown.
I thought we stopped -ly company names ~8 years ago?
Haha for sure. Naming isn't just the hardest problem in computer science, it's always hard. But at some point you just have to pick something and move forward.
if you talk to people gen-x and older, you still need .com domains
for all those people that aren't just clicking on a link on their social media feed, chat group, or targeted ad
May I ask as a layperson, how would you about using this to OCR multiple hundreds of pages? I tried the chat but it pretty much stops after the 2nd page.
You can check the example code on the Mistral documentation, you would _only_ have to change the value of the variable `document_url` to the URL of your uploaded PDF... and you need to change the `MISTRAL_API_KEY` to the value of your specific key that you can get from the Le Platforme webpage.
https://docs.mistral.ai/capabilities/document/#ocr-with-pdf
Thanks!
Submit the pages via the API.
This worked indeed. Although I had to cut my document into smaller chunks. 900 pages at once ended with a timeout.
Usually (With OpenAI, I haven't checked Mistral yet) it means an async api rather than a sync api.
e.g. you submit multiple requests (pdfs) in one call, and get back an id for the batch. You then can check on the status of that batch and get the results for everything when done.
It lets them use their available hardware to it's full capacity much better.
I would assume this is 1 request containing 2k pages vs N requests whose total pages add up to 1000.
I noticed on the Arabic example they lost a space after the first letter on the third to last line, can any native speakers confirm? (I only know enough Arabic to ask dumb questions like this, curious to learn more.)
Edit: it looks like they also added a vowel mark not present in the input on the line immediately after.
Edit2: here's a picture of what I'm talking about, the before/after: https://ibb.co/v6xcPMHv
Arabic speaker here. No, it's perfect.
I am pretty sure it added a kasrah not present in the input on the 2nd to last line. (Not saying it's not super impressive, and also that almost certainly is the right word, but I think that still means not quite "perfect"?)
Yes, it looks like it did add a kasrah to the word ظهري
He means the space between the wāw (و) and the word
I added a pic to the original comment, sorry for not being clear!
Nit: Please change the URL from
https://mistral.ai/fr/news/mistral-ocr
to
https://mistral.ai/news/mistral-ocr
The article is the same, but the site navigation is in English instead of French.
Unless it's a silent statement, of course.
For me, the second page redirects to the first. (And I don't live in France.)
Are there any open source projects with the same goal?
Related, does anyone know of an app that can read gauges from an image and log the number to influx? I have a solar power meter in my crawlspace, it is inconvenient to go down there. I want to point an old phone at it and log it so I can check it easily. The gauge is digital and looks like this:
https://www.pvh2o.com/solarShed/firstPower.jpg
You'll be happier finding a replacement meter that has an interface to monitor it directly or a second meter. An old phone and OCR will be very brittle.
Not OP, but it sounds like the kind of project I’d undertake.
Happiness for me is about exploring the problem within constraints and the satisfaction of building the solution. Brittleness is often of less concern than the fun factor.
And some kinds of brittleness can be managed/solved, which adds to the fun.
I would posit that learning how the device works, and how to integrate with a newer digital monitoring device would be just as interesting and less brittle.
Possibly! But I’ve recently wanted to dabble with computer vision, so I’d be looking at a project like this as a way to scratch a specific itch. Again, not OP so I don’t know what their priorities are, but just offering one angle for why one might choose a less “optimal” approach.
This[1] is something I've come across but not had a chance to play with, designed for reading non-smart meters that might work for you. I'm not sure if there's any way to run it on an old phone though.
[1] https://github.com/jomjol/AI-on-the-edge-device
https://www.home-assistant.io/integrations/seven_segments/
https://www.unix-ag.uni-kl.de/~auerswal/ssocr/
https://github.com/tesseract-ocr/tesseract
https://community.home-assistant.io/t/ocr-on-camera-image-fo...
https://www.google.com/search?q=home+assistant+ocr+integrati...
https://www.google.com/search?q=esphome+ocr+sensor
https://hackaday.com/2021/02/07/an-esp-will-read-your-meter-...
...start digging around and you'll likely find something. HA has integrations which can support writing to InfluxDB (local for sure, and you can probably configure it for a remote influxdb).
You're looking at 1xRaspberry PI, 1xUSB Webcam, 1x"Power Management / humidity management / waterproof electrical box" to stuff it into, and then either YOLO and DIY to shoot over to your influxdb, or set up a Home Assistant and "attach" your frankenbox as some sort of "sensor" or "integration" which spits out metrics and yadayada...
Gemini Free Tier would surely work
4o transcribes it perfectly. You can usually root an old Android and write this app in ~2h with LLMs if unfamiliar. The hard part will be maintaining camera lens cleanliness and alignment etc.
The time cost is so low that you should give it a gander. You'll be surprised how fast you can do it. If you just take screenshots every minute it should suffice.
6 years ago I was working with a very large enterprise that was struggling to solve this problem, trying to scan millions of arbitrary forms and documents per month to clearly understand key points like account numbers, names and addresses, policy numbers, phone numbers, embedded images or scribbled notes, and also draw relationships between these values on a given form, or even across forms.
I wasn't there to solve that specific problem but it was connected to what we were doing so it was fascinating to hear that team talk through all the things they'd tried, from brute-force training on templates (didn't scale as they had too many kinds of forms) to every vendor solution under the sun (none worked quite as advertised on their data)..
I have to imagine this is a problem shared by so many companies.
"World's best OCR model" - that is quite a statement. Are there any well-known benchmarks for OCR software?
We published this benchmark the other week. We'll can update and run with Mistral today!
https://github.com/getomni-ai/benchmark
Excellent. I am looking forward to it.
Came here to see if you all had run a benchmark on it yet :)
It’s interesting that none of the existing models can decode a Scrabble board screen shot and give an accurate grid of characters.
I realize it’s not a common business case, came across it testing how well LLMs can solve simple games. On a side note, if you bypass OCR and give models a text layout of a board standard LLMs cannot solve Scrabble boards but the thinking models usually can.
https://huggingface.co/spaces/echo840/ocrbench-leaderboard
Interesting. But no mistral on it yet?
Wow this basically "solves" DRM for books as well as opening up the door for digitizing old texts more accurately.
Nice demos but I wonder how well it does on longer files. I've been experimenting with passing some fairly neat PDFs to various LLMs for data extraction. They're created from Excel exports and some of the data is cut off or badly laid out, but it's all digitally extractable.
The challenge isn't so much the OCR part, but just the length. After one page the LLMs get "lazy" and just skip bits or stop entirely.
And page by page isn't trivial as header rows are repeated or missing etc.
So far my experience has definitely been that the last 2% of the content still takes the most time to accurately extract for large messy documents, and LLMs still don't seem to have a one-shot solve for that. Maybe this is it?
You will have to send one page at a time, most of this work has to be done via RAG. Adding a large context (like a whole PDF), still does not work that well in my experience.
We developers seem to really dislike PDFs, to a degree that we'll build LLMs and have them translate it into Markdown.
Jokes aside, PDFs really serve a good purpose, but getting data out of them is usually really hard. They should have something like an embedded Markdown version with a JSON structure describing the layout, so that machines can easily digest the data they contain.
I think you might be looking for PDF/A.
https://www.adobe.com/uk/acrobat/resources/document-files/pd...
For example, if you print a word doc to PDF, you get the raw text in PDF form, not an image of the text.
I feel this is created for RAG. I tried a document [0] that I tested with OCR; it got all the table values correctly, but the page's footer was missing.
Headers and footers are a real pain with RAG applications, as they are not required, and most OCR or PDF parsers will return them, and there is extract work to do to remove them.
[0] https://github.com/orasik/parsevision/blob/main/example/Mult...
The new Mistral OCR release looks impressive - 94.89% overall accuracy and significantly better multilingual support than competitors. As someone who's built document processing systems at scale, I'm curious about the real-world implications.
Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.
Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.
I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.
Excited to test this our on our side as well. We recently built an OCR benchmarking framework specifically for VLMs[1][2], so we'll do a test run today.
From our last benchmark run, some of these numbers from Mistral seem a little bit optimistic. Side by side of a few models:
model | omni | mistral |
gemini | 86% | 89% |
azure | 85% | 89% |
gpt-4o | 75% | 89% |
google | 68% | 83% |
Currently adding the Mistral API and we'll get results out today!
[1] https://github.com/getomni-ai/benchmark
[2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark
At my client we want to provide an AI that can retrieve relevant information from documentation (home building business, documents detail how to install a solar panel or a shower, etc) and we've set up an entire system with benchmarks, agents, etc, yet the bottleneck is OCR!
We have millions and millions of pages of documents and an off by 1 % error means it compounds with the AI's own error, which compounds with documentation itself being incorrect at times, which leads it all to be not production ready (and indeed the project has never been released), not even close.
We simply cannot afford to give our customers incorrect informatiin
We have set up a backoffice app that when users have questions, it sends it to our workers along the response given by our AI application and the person can review it, and ideally correct the ocr output.
Honestly after an year of working it feels like AI right now can only be useful when supervised all the time (such as when coding). Otherwise I just find LLMs still too unreliable besides basic bogus tasks.
As someone who has had a home built, and nearly all my friends and acquaintances report the same thing, having a 1% error on information in this business would mean not a 10x but a 50x improvement over the current practice in the field.
If nobody is supervising building documents all the time during the process, every house would be a pile of rubbish. And even when you do stuff stills creeps in and has to be redone, often more than once.
I have done OCR on leases. It’s hard. You have to be accurate and they all have bespoke formatting.
It would almost be easier to switch everyone to a common format and spell out important entities (names, numbers) multiple times similar to how cheques do.
The utility of the system really depends on the makeup of that last 5%. If problematic documents are consistently predictable, it’s possible to do a second pass with humans. But if they’re random, then you have to do every doc with humans and it doesn’t save you any time.
re: real world implications, LLMs and VLMs aren't magi, and anyone who goes in expecting 100% automation is in for a surprise (especially in domains like medical or legal).
IMO there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases.
e.g. you still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort.
But for RAG and other use cases where the error tolerance is higher, I do think these OCR models will get good enough to just solve that part of the problem.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)
I'd love to try it for my domain (regulation), but $1/1000 pages is significantly more expensive than my current local Docling based setup that already does a great job of processing PDF's for my needs.
I think for regulated fields / high impact fields $1/1000 is well-worth the price; if the accuracy is close to 100% this is way better than using people, who are still error-prone
> Has anyone tried this on specialized domains like medical or legal documents?
I’ll try it on a whole bunch of scientific papers ASAP. Quite excited about this.
$1 for 1000 pages seems high to me. Doing a google search
Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from $1.35/hour
I just don't know if in 1 hour and with a A100 I can process more than a 1000 pages. I'm guessing yes.
Is the model Open Source/Weight? Else the cost is for the model, not GPU.
Also interesting to see that parts of the training infrastructure to create frontier models is itself being monetized.
What do you mean by "free"? Using the OpenAI vision API, for example, for OCR is quite a bit more expensive than $1/1k pages.
We’ll just stick LLM Gateway LLM in front of all the specialized LLMs. MicroLLMs Architecture.
I actually think you're onto something there. The "MicroLLMs Architecture" could mirror how microservices revolutionized web architecture.
Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.
The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.
The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.
Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.
I’m doing this personally for my own project - essentially building an agent graph that starts with the image output, orients and cleans, does a first pass with tesseract LSTM best models to create PDF/HOCR/Alto, then pass to other LLMs and models based on their strengths to further refine towards markdown and latex. My goal is less about RAG database population but about preserving in a non manually typeset form the structure and data and analysis, and there seems to be pretty limited tooling out there since the goal generally seems to be the obviously immediately commercial goal of producing RAG amenable forms that defer the “heavy” side of chart / graphic / tabular reproduction to a future time.
This is already done with agents. Some agents only have tools and the one model, some agents will orchestrate with other LLMs to handle more advanced use cases. It's pretty obvious solution when you think about how to get good performance out of a model on a complex task when useful context length is limited: just run multiple models with their own context and give them a supervisor model—just like how humans organize themselves in real life.
Making Transformers the same cost as CNN's (which are used in character-level ocr, as opposed to image-patch-level) is a good thing. The problem with CNN based character-level OCR is not the recognition models but the detection models. In a former life, I found a way to increase detection accuracy, and, therefore, overall OCR accuracy, and used that as an enhancement on top of Amazon and Google OCR. It worked really well. But the transformer approach is more powerful and if it can be done for $1 per 1000 pages, that is a game changer, IMO, at least of incumbents offering traditional character-level OCR.
It certainly isn't the same cost if expressed as a non-subsidized $$$ one needs for the Transformers compute aka infra.
CNNs trained specifically for OCR can run in real time on as small compute as a mobile device is.
I would like to see how it performs with massively warped and skewed scanned text images, basically a scanned image where the text lines are wavy as opposed as straight horizontal, where the letters are elongated. One where the line widths are different depending on the position on the scanned image. I once had to deal with such a task that somebody gave me with OCR software, Acrobat, and other tools could not decode the mess so I had to recreate the 30 pages myself, manually. Not a fun thing to do but that is a real use case.
Garbage in, garbage out?
"Yes" but if a human could do it "AI" should be able to do it too.
I wonder how good it would be to convert sheet music to MusicXML. All the current tools more or less suck with this task, or maybe I’m just ignorant and don’t know what lego bricks to put together.
Try our machine-learning powered sheet music scanning engine at Soundslice:
https://www.soundslice.com/sheet-music-scanner/
Definitely doesn't suck.
Is there a reliable handwriting OCR benchmark out there (updated, not a blog post)? Despite the gains claimed for printed text, I found (anecdotally) that trying to use Mistral OCR on my messy cursive handwriting to be much less accurate than GPT-4o, in the ballpark of 30% wrong vs closer to 5% wrong for GPT-4o.
Edit: answered in another post: https://huggingface.co/spaces/echo840/ocrbench-leaderboard
Simon Willison linked to an impressive demo of Qwen2-VL in this area: I haven't found a version of it that I could run locally yet to corroborate. https://simonwillison.net/2024/Sep/4/qwen2-vl/
One of my hobby projects while in University was to do OCR on book scans. Doing character recognition was solved, but finding the relationship between characters was very difficult. I tried "primitive" neural nets, but edge cases would often break what I built. Super cool to me to see such an order of magnitude in improvement here.
Does it do hand written notes and annotations? What about meta information like highlighting? I am also curious if LLMs will get better because more access to information if it can be effectively extracted from PDFs.
* Character recognition on monolingual text in a narrow domain is solved
Does this support Japanese? They list a table of language comparisons againat other approaches but I can't tell if it is exhaustive.
I'm hoping that something like this will be able to handle 3000-page Japanese car workshop manuals. Because traditional OCR really struggles with it. It has tables, graphics, text in graphics, the whole shebang.
It will be interesting to see how all the companies in the document processing space adapt as OCR becomes a commodity.
The best products will be defined by everything "non-AI", like UX, performance and reliability at scale, and human-in-the loop feedback for domain experts.
They will offer integrations into enterprise systems, just like they do today.
Lots of big companies don't like change. The existing document processing companies will just silently start using this sort of service to up their game, and keep their existing relationships.
I 100% agree with this, I think you can even extend this to any AI, in the end, IMO, as the llm is more commoditized, the surface of which the value is delivered will matter more
Someone working there has good taste to include a Nizar Qabbani poem.
I was just watching a science-related video containing math equations. I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.
It's only a matter of time before "browsing" means navigating HTTP sites via LLM prompts. although, I think it is critical that LLM input should NOT be restricted to verbal cues. Not everyone is an extrovert that longs to hear the sound of their own voices. A lot of human communication is non-verbal.
Once we get over the privacy implications (and I do believe this can only be done by worldwide legislative efforts), I can imagine looking at a "website" or video, and my expressions, mannerisms and gestures will be considered prompts.
At least that is what I imagine the tech would evolve into in 5+ years.
> I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.
Seems like https://aiscreenshot.app might fit the bill.
Good lord, I dearly hope not. That sounds like a coddled hellscape world, something you'd see made fun of in Disney's Wall-E.
hence my comment about privacy and need for legislation :)
It isn't the tech that's the problem but the people that will abuse it.
While those are concerns, my point was that having everything on the internet navigated to, digested and explained to me sounds unpleasant and overall a drain on my ability to think and reason for myself.
It is specifically how you describe using the tech that provokes a feeling of revulsion to me.
Then I think you misunderstand. The ML system would know when you want things digested to you or not. Right now companies are assuming this and forcing LLM interaction. But when properly done, the system would know based on your behavior or explicit prompts what you want and provide the service. If you're staring at a paragraph intently and confused, it might start highlighting common phrases or parts of the text/picture that might be hard to grasp and based on your reaction to that, it might start describing things via audio,tool tips,side pane,etc.. In other words, if you don't like how and when you're interacting with the LLM ecosystem, then that is an immature and failing ecosystem, in my vision this would be a largely solved problems, like how we interact with keyboards,mouse and touchscreens today.
Now? OK, you need to screencap and upload to LLM, but that's well established tech by now. (Where by "well established", I mean at least 9 months old ;)
Same goes for "navigating HTTP sites via LLM prompts". Most LLMs have web search integration, and the "Deep Research" variants do more complex navigation.
Video chat is there partially, as well. It doesn't really pay much attention to gestures & expressions, but I'd put the "earliest possible" threshold for that a good chunk closer than 5 years.
Yeah, all these things are possible today, but getting them well polished and integrated is another story. Imagine all this being supported by "HTML6" lol. When apple gets around to making this part of safari, then we know it's ready.
That's a great upper-bound estimator ;)
But kidding aside - I'm not sure people want this being supported by web standards. We could be a huge step closer to that future had we decided to actually take RDF/Dublin Core/Microdata seriously. (LLMs perform a lot better with well-annotated data)
The unanimous verdict across web publishers was "looks like a lot of work, let's not". That is, ultimately, why we need to jump through all the OCR hoops. Not only did the world not annotate the data, it then proceeded to remove as many traces of machine readability as possible.
So, the likely gating factor is probably not Apple & Safari & "HTML6" (shudder!)
If I venture my best bet what's preventing polished integration: It's really hard to do via foundational models only, and the number of people who want to have deep & well-informed conversations via a polished app enough that they're willing to pay for an app that does that is low enough that it's not the hot VC space. (Yet?)
Crystal ball: Some OSS project will probably get within spitting distance of something really useful, but also probably flub the UX. Somebody else will take up these ideas while it's hot and polish it in a startup. So, 18-36 months for an integrated experience from here?
I feel like i can't create an agent with their OCR model yet ? Is it something planned or it's only API?
What do you mean by agent?
La Plateforme agent builder - https://console.mistral.ai/build/agents/new
Dupe of an hour previous post https://news.ycombinator.com/item?id=43282489
I tried with both PDFs and PNGs in Le Chat and the results were the worst I've ever seen when compared to any other model (Claude, ChatGPT, Gemini).
So bad that I think I need to enable the OCR function somehow, but couldn't find it.
It worked perfectly for me with a simple 2 page PDF that contained no graphics or formatting beyond headers and list items. Since it was so small I had the time to proof-read it and there were no errors. It added some formatting, such as bolding headers in list items and putting tics around file and function names. I won't complain.
I'm experiencing the same. Maybe the sentence "Mistral OCR capabilities are free to try on le Chat." was a hallucination.
Co-founder of doctly.ai here (OCR tool)
I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.
I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:
```  ```
I'll keep testing, but so far, very disappointing :(
This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.
Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.
I would have loved to add this into the judge list, but might have to skip it.
Where did you test it? At the end of the post they say:
> Mistral OCR capabilities are free to try on le Chat
but when asked, Le Chat responds:
> can you do ocr?
> I don't have the capability to perform Optical Character Recognition (OCR) directly. However, if you have an image with text that you need to extract, you can describe the text or provide details, and I can help you with any information or analysis related to that text. If you need OCR functionality, you might need to use a specialized tool or service designed for that purpose.
Edit: Tried anyway by attaching an image; it said it could do OCR and then output... completely random text that had absolutely nothing to do with the text in the image!... Concerning.
Tried again with a better definition image, output only the first twenty words or so of the page.
Did you try using the API?
Yes I used the API. They have examples here:
https://docs.mistral.ai/capabilities/document/
I used base64 encoding of the image of the pdf page. The output was an object that has the markdown, and coordinates for the images:
[OCRPageObject(index=0, markdown='', images=[OCRImageObject(id='img-0.jpeg', top_left_x=140, top_left_y=65, bottom_right_x=2136, bottom_right_y=1635, image_base64=None)], dimensions=OCRPageDimensions(dpi=200, height=1778, width=2300))] model='mistral-ocr-2503-completion' usage_info=OCRUsageInfo(pages_processed=1, doc_size_bytes=634209)
Interestingly I’m currently going through and scanning the hundreds of journal papers my grandfather authored in medicine and thinking through what to do about graphs. I was expecting to do some form of multiphase agent based generation of LaTeX or SVG rather than a verbal summary of the graphs. At least in his generation of authorship his papers clearly explained the graphs already. I was pretty excited to see your post naturally but when I looked at the examples what I saw was, effectively, a more verbose form of
```  ```
I’m assuming this is partially because your use case is targeting RAG under various assumptions bur also partially because multimodal models aren’t near what I would need to be successful with?
We need to update the examples on the front page. Currently for things that are considered charts/graphs/figures we convert to a description. For things like logos or images we do an image tag. You can also choose to exclude them.
The difference with this is that it took the entire page as an image tag (it's just a table of text in my document). rather than being more selective.
I do like that they give you coordinates for the images though, we need to do something like that.
Give the actual tool a try. Would love to get your feedback for that use case. It gives you 100 free credits initially but if you email me (ali@doctly.ai), I can give you an extra 500 (goes for anyone else here also)
If you have a judge system, and Mistral performs well on other tests, wouldn't you want to include it so if it scores the highest by your judges ranking it would select the most accurate result? Or are you saying that mistral's image markdown would score higher on your judge score?
We'll definitely be doing more tests, but the results I got on the complex tests would result in a lower score and might not be worth the extra cost of the judgement itself.
In our current setup Gemini wins most often. We enter multiple generations from each model into the 'tournament', sometimes one generation from gemini could be at the top while another in the bottom, for the same tournament.
Does doctly do handwritten forms like dates?
I have a lot of "This document filed and registered in the county of ______ on ______ of _____ 2023" sort of thing.
We've been getting great results with those aswell. But ofcourse there is always some chance of not getting it perfect, specially with different handwritings.
Give it a try, no credit cards needed to try it. If you email me (ali@doctly.ai) i can give you extra free credits for testing.
Just tried it. Got all the dates correct and even extracted signatures really well.
Now to figure out how many millions of pages I have.
Why pay more for doctly than an AWS Textract?
Great question. The language models are definitely beating the old tools. Take a look at Gemini for example.
Doctly runs a tournament style judge. It will run multiple generations across LLMs and pick the best one. Outperforming single generation and single model.
Would love to see the test file.
would be glad to see benchmarking results
This is a good idea. We should publish a benchmark results/comparison.
But what's the need exactly for OCR when you have multimodal LLMs that can read the same info and directly answer any questions about it ?
For a VLLM, my understanding is that OCR corresponds to a sub-field of questions, of the type 'read exactly what's written in this document'.
Getting PDFs into #$@ Confluence apparently. Just had to do this and Mistral saved me a ton of hassle compared to this: https://community.atlassian.com/forums/Confluence-questions/...
The biggest risk of vision LLMs for OCR is that they might accidentally follow instructions is the text that they are meant to be processing.
(I asked Mistral if their OCR system was vulnerable to this and they said "should be robust, but curious to see if you find any fun examples" - https://twitter.com/simonw/status/1897713755741368434 and https://twitter.com/sophiamyang/status/1897719199595720722 )
It's useful to have the plain text down the line for operations not involving a language model (e.g. search). Also if you have a bunch of prompts you want to run it's potentially cheaper, although perhaps less accurate, to run the OCR once and save yourself some tokens or even use a smaller model for subsequent prompts.
Tons of uses: Storage (text instead of images), search (user typing in a text box and you want instant retrieval from a dataset), etc. And costs: run on images once - then the rest of your queries will only need to run on text.
> It takes images and PDFs as input
If you are working with PDF, I would suggest a hybrid process.
It is feasible to extract information with 100% accuracy from PDFs that were generated using the mappable acrofields approach. In many domains, you have a fixed set of forms you need to process and this can be leveraged to build a custom tool for extracting the data.
Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.
The moment you need to use this kind of technology you are in a completely different regime of what the business will (should) tolerate.
> Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.
It's always safer to OCR on every file. Sometimes you'll have a "clean" pdf that has a screenshot of an Excel table. Or a scanned image that has already been OCR'd by a lower quality tool (like the built in Adobe OCR). And if you rely on this you're going to get pretty unpredictable results.
It's way easier (and more standardized) to run OCR on every file, rather than trying to guess at the contents based on the metadata.
It's not guessing if the form is known and you can read the information directly.
This is a common scenario at many banks. You can expect nearly perfect metadata for anything pushed into their document storage system within the last decade.
Oh yea if the form is known and standardized everything is a lot easier.
But we work with banks on our side, and one of the most common scenarios is customers uploading financials/bills/statements from 1000's of different providers. In which case it's impossible to know every format in advance.
The hard ones are things like contracts, leases, and financial documents which 1) don’t have a common format 2) are filled with numbers proper nouns and addresses which it’s really important not to mess up 3) cannot be inferred from context.
Typical OCR pipeline would be to pass the doc through a character-level OCR system then correct errors with a statistical model like an LLM. An LLM can help correct “crodit card” to “credit card” but it cannot correct names or numbers. It’s really bad if it replaces a 7 with a 2.
I wonder how it compares to USPS workers at deciphering illegible handwriting.
What's the general time for something like this to hit openrouter? I really hate having accounts everywhere when I'm trying to test new things.
Is this free in LeChat? I uploaded a handwritten text and it stopped after the 4th word.
This is $1 per 1000 pages. For comparison, Azure Document Intelligence is $1.5/1000 pages for general OCR and $30/1000 pages for “custom extraction”.
Given the wide variety of pricing on all of these providers, I keep wondering how the economics work. Do they have fantastic margin on some of these products or is it a matter of subsidizing the costs, hoping to capture the market? Last I heard, OpenAI is still losing money.
Looks good but in the first hover/slider demo one can see how it could lead to confusion when handling side by side content.
Table 1 is referred to in section `2 Architectural details` but before `2.1 Multimodal Decoder`. In the generated markdown though it is below the latter section, as if it was in/part of that section.
Of course I am nitpicking here but just the first thing I noticed.
Does anything handle dual columns well? Despite being the academic standard, it seemingly throws off every generic tool.
Its funny how Gemini consistently beats googles dedicated document API.
I'm not surprised honestly - it's just the newer better things vs their older offering
How does one use it to identify bounding rectangles of images/diagrams in the PDF?
Tried with a few historical handwritten German documents, accuracy was abysmal.
HTR ( Handwritten Text Recognition ) is a completely different space than OCR. What were you expecting exactly?
It fits the "use cases" mentioned in the article
> Preserving historical and cultural heritage: Organizations and nonprofits that are custodians of heritage have been using Mistral OCR to digitize historical documents and artifacts, ensuring their preservation and making them accessible to a broader audience.
There is a difference between historical document and "my doctor prescription".
Someone coming here and saying it does not work with my old german hanwriting doesn't say much.
Probably they are overfitting the benchmarks, since other users also complain of the low accuracy
Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) are different tasks
Le chat doesn’t seem to know about this change despite the blog post stating it. Can anyone explain how to use it in Le Chat?
I asked LeChat this question:
If I upload a small PDF to you are you able to convert it to markdown?
LeChat said yes and away we went.
Looks to be API only for now. Documentation here: https://docs.mistral.ai/capabilities/document/
Is this model open source?
No (nor is it open-weights).
Bit unrelated but is there anything that can help with really low resolution text? My neighbor got hit and run the other day for example, and I've been trying every tool I can to make out some of the letters/numbers on the plate
https://ibb.co/mr8QSYnj
Finding the right subreddit and asking there is probably a better approach if you want to maximize the chances of getting the plate 'decrypted'.
If it’s a video, sharing a few frames can help as well
To even get started on this you'd also need to share some contextual information like continent, country etc. I'd say.
Its in CA, looks like paper plates which follow a specific format and the last two seem to be the numbers '64'. Police should be able to search for temp tag with partial match and match the make/model. Was curious to see if any software could help though
Looks like a paper temp tag. Other than that, I'm not sure much can be had from it.
There are photo enhancers online. But your picture is way too pixelated to get any useful info from it.
If you know the font in advance (which you often do in these cases) you can do insane reconstructions. Also keep in mind that it doesn't have to be a perfect match, with the help of the color and other facts (such as likely location) about the car you can narrow it down significantly.
Maybe if you had multiple frames, and used something very clever?
So, the only thing that stopped AI from learning from all our science and taking over the world was the difficulty of converting PDFs of academic papers to more computer readable formats.
Not anymore.
Just tested with a multilingual (bidi) English/Hebrew document.
The Hebrew output had no correspondence to the text whatsoever (in context, there was an English translation, and the Hebrew produced was a back-translation of that).
Their benchmark results are impressive, don't get me wrong. But I'm a little disappointed. I often read multilingual document scans in the humanities. Multilingual (and esp. bidi) OCR is challenging, and I'm always looking for a better solution for a side-project I'm working on (fixpdfs.com).
Also, I thought OCR implied that you could get bounding boxes for text (and reconstruct a text layer on a scan, for example). Am I wrong, or is this term just overloaded, now?
You can get bounding boxes from our pdf api at Mathpix.com
Disclaimer, I’m the founder
Mathpix is ace. That’s the best results I got so far for scientific papers and reports. It understands the layout of complex documents very well, it’s quite impressive. Equations are perfect, figures extraction works well.
There are a few annoying issues, but overall I am very happy with it.
Curious to see how this performance against more real world usage of someone taking a photo of text (which the text then becomes slightly blurred) and performing OCR on it.
I can't exactly tell if the "Mistral 7B" image is an example of this exact scenario.
A similar but different product that was discussed on HN is OlmOCR from AI2, which is open source:
https://news.ycombinator.com/item?id=43174298
I don't need AGI just give me superhuman OCR so we can turn all existing pdfs into text* and cheaply host it.
Feels like we are almost there.
*: https://annas-archive.org/blog/critical-window.html
It's disappointing to see that the benchmark results are so opaque. I hope we see reproducible results soon, and hopefully from Mistral themselves.
1. We don't know what the evaluation setup is. It's very possible that the ranking would be different with a bit of prompt engineering.
2. We don't know how large each dataset is (or even how the metrics are calculated/aggregated). The metrics are all reported as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]
3. We don't know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn't work, and deducing that it's probably bad data and removing it.)
[1] https://medium.com/towards-data-science/digit-significance-i...
Oh - on premise solution - awesome!
Given the fact that multi-modal LLMs are getting so good at OCR these days, is it a shame that we can't do local OCR with high accuracy in the near-term?
They say: "releasing the API mistral-ocr-latest at 1000 pages / $"
I had to reread that a few times. I assume this means 1000pg/$1 but I'm still not sure about it.
Great example of how information is sometimes compartmentalized arbitrarily in the brain: I imagine you have never been confused by sentences such as “I’m running at 10 km/h”.
Dollar signs go before the number, not after it like units. It needs to be 1000 pages/$1 to make sense, whereas 10km and 10h and 10/h all make sense so 10km/h does. I imagine you would be confused by km/h 10 but not $10.
Yeah you can read it as "pages per dollar" or as a unit "pages/$", it all comes out the same meaning.
Hmm, can it read small print? ;)
Ya, presumably it is missing the number `1.00`.
Not really. When you go 60 mph (or km/h) you don't specify the 1.00 for the hours either. pages/$ is the unit, 1000 is the value.
I'm using gemini to solve textual CAPTCHA with some good results (better than untrained OCR).
I will give this a shot
Wonder how it does with table data in pdfs / page-long tabular data?
Is this burying the lede? OCR is a solved problem, but structuring document data from scans isn't.
Ohhh. Gonna test it out with some 100+ year old scribbles :)
1. There’s no simple page / sandbox to upload images and try it. Fine, I’ll code it up.
2. “Explore the Mistral AI APIs” (https://docs.mistral.ai) links to all apis except OCR.
3. The docs on the api params refer to document chunking and image chunking but no details on how their chunking works?
So much unnecessary friction smh.
There is an OCR page on the link you provided. It includes a very, very simple curl command (like most of their docs).
I think the friction here exists outside of Mistral's control.
> There is an OCR page on the link you provided.
I don’t see it either. There might be some caching issue.
> "Fastest in its category"
Not one mention of the company that they have partnered with and that is Cerebras AI and that is the reason they have fast inference [0]
Literally no-one here is talking about them and they are about to IPO.
[0] https://cerebras.ai/blog/mistral-le-chat
A great question for people wanting to use OCR in business is... Which digits in monetary amounts can you tolerate being incorrect?
or when + becomes 4 and isn't caught during review
I wonder if a superimposed copy of the document on the original (with coloring or other highlighting of the diff) would help to catch some important errors.. the model would have to include layout of the copy in addition to the text/images, which I think is a little beyond SOTA but attainable.
Document processing is where b2b SAAS is at.
I'm happy to see this development after being underwhelmed with Chatgpt OCR!
Perusing the web site, it's depressing how much behind Mistral is on basic "how can I make this a compelling hook for customers" for the page.
The notebook link? An ACL'd doc
The examples don't even include a small text-to-markdown sample.
The before/after slider is cute, but useless - SxS is a much better way to compare.
Trying it in "Le Chat" requires a login.
It's like an example of "how can we implement maximum loss across our entire funnel". (I have no doubt the underlying tech does well, but... damn, why do you make it so hard to actually see it, Mistral?)
If anybody tried it and has shareable examples - can you post a link? Also, anybody tried it with handwriting yet?
Pretty cool, would love to use this with paperless, but I just can't bring myself to send a photo of all my documents to a third party, especially legal and sensitive documents, which is what I use Paperless for.
Because of that I'm stuck with crappy vision on Ollama (Thanks to AMDs crappy ROCm support for Vllm)
Really cool, thanks Mistral!
For general use this will be good.
But I bet that simple ML will lead to better OCRs when you are doing anything specialized, such as, medical documents, invoices etc.
LLM based OCR is a disaster, great potential for hallucinations and no estimate of confidence. Results might seem promising but you’ll always be wondering.
CNN-based OCR also have "hallucinations" and Transformers aren't that much different in that respect. This is a problem solved with domain specific post-processing.
well already in 2013 ocr systems used in xerox scanners (turned on by default!) randomly altered numbers, so its not an issue only occuring in llms.
Congrats to Mistral for yet again releasing another closed source thing that costs more than running an open source equivalent:
https://github.com/DS4SD/docling
Back in my days Mistral used to torrent models.
I am all for open source, but where do you see benchmarks that conclude that it's just equivalent?
It's shocking how much our industry fails to see past its own nose.
Not a single example on that page is a Purchase Order, Invoice etc. Not a single example shown is relevant to industry at scale.
Mistral is Europe based where invoices are more or less sent digitally in like 95% of all the cases anyway. Some are even digital invoices, which will at some point in the eu be mandatory. For orders there are proposals for that, too. And basically invoice data extraction is a different beast.
So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.
Your best bet is to always convert it to an image and OCR it to extract structured data.
One use-case is digitising receipts from business related travels for expenses that employees paid for out of their own pocket and which they are submitting pictures to the business for reimbursement.
Bus travels, meals including dinners and snacks, etc. for which the employee has receipts on paper.
Can confirm, in Italy electronic invoicing is mandatory since 2019
even in Europe this is still a thing, I know of systems which still are unable to read items having more than one line (costing s sh*tload of money)
Another good example would be contracts of any kind. Imagine photographing a contract (like a car loan) and on the spot getting an AI to read it, understand it, forecast scenarious, highlight red flags, and do some comparison shopping for you.
I wanted to apply OCR to my company's invoicing since they basically did purchasing for a bunch of other large companies, but the variability in the conversion was not tolerable. Even rounding something differently could catch an accountant's eye, let alone detecting a "8" as a "0" or worse.
Fwiw, they have an example of a parking receipt in a cookbook: https://colab.research.google.com/github/mistralai/cookbook/...
To be fair: Reading the blog post, the main objective seems to have been to enable information extraction with high confidence for the academic sector (e.g. unlocking all these paper pdfs), and not necessarily to be another receipt scanner.
Businesses at scale use EDI to handle purchase orders and invoices, no OCR needed.
Thats simply not a factual statement.
Scaled businesses do USE edi, but they still receive hundreds of thousands of PDF documents a month
source: built a saas product that handles pdfs for a specific industry
Agreed. In general I've had such bad performance for complex table based invoice parsing, that every few months I try the latest models to see if its better. It does say "96.12" on top-tier benchmark under the Table category.
We find CV models to be better (higher midpoint on an ROC curve) for the types of docs you mention.
Such a shame that PDF doesn’t just, like, include the semantic structure of the document by default. It is brilliant that we standardized on an archival document format that doesn’t include direct access to the document text or structure as a core intrinsic default feature.
I say this with great anger as someone who works in accessibility and has had PDF as a thorn in my side for 30 years.
I agree with this so much. I've tried to sometimes push friends and family to use text formats (at least I sent them something like Markdown), which is very easy to render in the browser anyways. But often you have to fall back to PDF, which I dislike very much. There's so much content like books and papers that are in PDF as well. Why did we pick a binary blob as shareable format again?
Even assuming you could get people to do the work (probably the real issue here) could a single schema syntax capture the semantics of the universe of documents that exist as PDFs? PDFs succeeded because they could reproduce anything.
html
Tables? I regularly run into PDFs where even the body text is mangled!
PDF is pretty strictly modeled on printed documents and their mainstream typography at the time of invention of Postscript and so on.
Printed documents do not have any structure beyond the paper and placement of ink on them.
[dead]
[flagged]
No comments there yet - this at the top of the home page, let’s use this one.
Don't need french /fr in the url. That is the one.
[dead]
It's a weird timing because I just launched https://dochq.io - ai document extraction where you can define what you need to get out your documents in plain English, I legitimately thought that this was going to be such a niche product but hell, there has been a very rapid rise for AI-based OCR lately, an article/tweet even went viral 2 weeks ago I think? About using Gemini to do OCR, fun times.