Show HN: Zerox – Document OCR with GPT-mini

github.com

246 points by themanmaran 2 months ago

This started out as a weekend hack with gpt-4-mini, using the very basic strategy of "just ask the ai to ocr the document".

But this turned out to be better performing than our current implementation of Unstructured/Textract. At pretty much the same cost.

I've tested almost every variant of document OCR over the past year, especially trying things like table / chart extraction. I've found the rules based extraction has always been lacking. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. Using a vision model just make sense!

In general, I'd categorize this solution as slow, expensive, and non deterministic. But 6 months ago it was impossible. And 6 months from now it'll be fast, cheap, and probably more reliable!

serjester a month ago

It should be noted for some reason OpenAI prices GPT-4o-mini image requests at the same price as GPT-4o. I have a similar library but we found OpenAI has subtle OCR inconsistencies with tables (numbers will be inaccurate). Gemini Flash, for all its faults, seems to do really well as a replacement while being significantly cheaper.

Here’s our pricing comparison:

*Gemini Pro* - $0.66 per 1k image inputs (batch) - $1.88 per text output (batch API, 1k tokens) - 395 pages per dollar

*Gemini Flash* - $0.066 per 1k images (batch) - $0.53 per text output (batch API, 1k tokens) - 1693 pages per dollar

*GPT-4o* - $1.91 per 1k images (batch) - $3.75 per text output (batch API, 1k tokens) - 177 pages per dollar

*GPT-4o-mini* - $1.91 per 1k images (batch) - $0.30 per text output (batch API, 1k tokens) - 452 pages per dollar

[1] https://community.openai.com/t/super-high-token-usage-with-g...

[2] https://github.com/Filimoa/open-parse

  • themanmaran a month ago

    Interesting. It didn't seem like gpt-4o-mini was priced the same as gpt-4o during our testing. We're relying on OpenAI usage page of course, which doesn't give as much request by request pricing. But we didn't see any huge usage spike after testing all weekend.

    For our testing we ran a 1000 page document set, all treated as images. We got to about 25M input / 0.4M output tokens for 1000 pages. Which would be a pretty noticeable difference based on the listed token prices.

    gpt-4o-mini => (24M/1M * $0.15) + (0.4M/1M * 0.60) = $4.10

    gpt-4o => (24M/1M * $5.00) + (0.4M/1M * 15.00) = $126.00

    • serjester a month ago

      The pricing is strange because the same images will use up 30X more tokens with mini. They even show this in the pricing calculator.

      [1] https://openai.com/api/pricing/

      • elvennn a month ago

        Indeed it does. But also the price for output tokens of the OCR is cheaper. So in total it's still much cheaper with gpt-4o-mini.

  • raffraffraff a month ago

    That price compares favourably with AWS Textract. Has anyone compared their performance? Because a recent post about OCR had Textract at or near the top in terms of quality.

    • ianhawes a month ago

      Can you locate that post? In my own experience, Google Document AI has superior quality but I'm looking for something a bit more objective and scientific.

    • aman2k4 a month ago

      I’m using AWS textract for scanning grocery receipts and i find it does it very well and fast. Can you say which performance metric you have in mind?

8organicbits 2 months ago

I'm surprised by the name choice, there's a large company with an almost identical name that has products that do this. May be worth changing it sooner rather than later.

https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web

  • ot a month ago

    > there's a large company with an almost identical name

    Are you suggesting that this wasn't intentional? The name is clearly a play on "zero shot" + "xerox"

    • UncleOxidant a month ago

      I think they're suggesting that Xerox will likely sue them so might as well get ahead of that and change the name now.

      • 8organicbits a month ago

        Even if they don't sue, do you really want to deal with people getting confused and thinking you mean one of the many pre-existing OCR tools that Xerox produces? A search for "Zerox OCR" will lead to Xerox products, for example. Not worth the headache.

        https://duckduckgo.com/?q=Zerox+OCR

    • themanmaran a month ago

      Yup definitely a play on the name. Also the idea of photocopying a page, since we do pdf => image => markdown.

      We're not planning to name a company after it or anything, just the OS tool. And if xerox sues I'm sure we could rename the repo lol.

      • ssl-3 a month ago

        I was involved in a somewhat similar trademark issue once.

        I actually had a leg to stand on (my use was not infringing at all when I started using it), and I came out of it somewhat cash-positive, but I absolutely never want to go through anything like that ever again.

        > Yup definitely a play on the name. Also the idea of photocopying a page,

        But you? My God, man.

        With these words you have already doomed yourself.

        Best wishes.

        • neilv a month ago

          > With these words you have already doomed yourself.

          At least they didn't say "xeroxing a page".

      • wewtyflakes a month ago

        It still seems reasonable someone may be confused, especially since the one letter of the company name that was changed has identical pronunciation (x --> z). It is like offering "Phacebook" or "Netfliks" competitors, but even less obviously different.

        • qingcharles a month ago

          Surprisingly, http://phacebook.com/ is for sale.

          • austinjp a month ago

            From personal experience, I'd wager that anyone buying that domain will receive a letter from a Facebook lawyer pretty quickly.

      • haswell a month ago

        If they sue, this comment will be used to make their case.

        I guess I just don’t understand - how are you proceeding as if this is an acceptable starting point?

        With all respect, I don’t think you’re taking this seriously, and it reflects poorly on the team building the tool. It looks like this is also a way to raise awareness for Omni AI? If so, I’ve gotta be honest - this makes me want to steer clear.

        Bottom line, it’s a bad idea/decision. And when bad ideas are this prominent, it makes me question the rest of the decisions underlying the product and whether I want to be trusting those decision makers in the many other ways trust is required to choose a vendor.

        Not trying to throw shade; just sharing how this hits me as someone who has built products and has been the person making decisions about which products to bring in. Start taking this seriously for your own sake.

      • ned_at_codomain a month ago

        I would happily contribute to the legal defense fund.

  • blacksmith_tb a month ago

    If imitation is the sincerest form of flattery, I'd have gone with "Xorex" myself.

    • kevin_thibedeau a month ago

      We'll see what the new name is when the C&D is delivered.

  • HumblyTossed a month ago

    I'm sure that was on purpose.

    Edit: Reading the comments below, yes, it was.

    Very disrespectful behavior.

  • 627467 a month ago

    the commercial service is called OmniAI. zerox is just the name of a component (github repo, library) in a possible software stack.

    am I only one finding these sort of takes silly in a cumulative globalized world with instant communications? There are so many things to be named, everything named is instantly available around the world, so many jurisdictions to cover - not all providing the same levels of protections to "trademarks".

    Are we really suggesting this issue is worth defending and spending resources on?

    what is the ground for confusion here? that a developer stumbles on here and thinks zerox is developed/maintained by xerox? this developer gets confused but won't simply check who is the owner of the repository? What if there's a variable called zerox?

    I mean, I get it: the whole point of IP at this point is really just to create revenue streams for the legal/admin industry so we should all be scared and spend unproductive time naming a software dependency

    • 8organicbits a month ago

      > Are we really suggesting this issue is worth defending and spending resources on?

      Absolutely.

      Sure, sometimes non-competing products have the same name. Or products sold exclusively in one country use the same name as a competitor in a different country. There's also companies that don't trademark or protect their names. Often no one even notices the common name.

      That's not whats happening here. Xerox is famously litigious about their trademark; often used as a case study. The product competes with Xerox OCR products in the same countries.

      It's a strange thing to be cavalier about and to openly document intent to use a sound-alike name. Besides, do you really want people searching for "Zerox OCR" to land on a Xerox page? There's no shortage of other names.

    • HumblyTossed a month ago

      > so we should all be scared and spend unproductive time naming a software dependency

      All 5 minutes it would take to name it something else?

  • pkaye 2 months ago

    Maybe call it ZeroPDF?

  • froh 2 months ago

    gpterox

hugodutka a month ago

I used this approach extensively over the past couple of months with GPT-4 and GPT-4o while building https://hotseatai.com. Two things that helped me:

1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.

2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.

  • themanmaran a month ago

    One option we've been testing is the 'maintainFormat` mode. This tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. Especially useful if you've got tables that span pages. The flow is pretty much:

    - Request #1 => page_1_image

    - Request #2 => page_1_markdown + page_2_image

    - Request #3 => page_2_markdown + page_3_image

  • sidmitra a month ago

    >frequency of character triples

    What are character triples? Are they trigrams?

    • hugodutka a month ago

      I think so. I'd normalize the text first: lowercase it and remove all non-alphanumeric characters. E.g for the phrase "What now?" I'd create these trigrams: wha, hat, atn, tno, now.

  • nbbaier a month ago

    > I extracted the embedded text from the PDF

    What did you use to extract the embedded text during this step? Other than some other OCR tech

    • hugodutka a month ago

      PyMuPDF, a PDF library for Python.

      • jimmySixDOF a month ago

        A different approach from vanilla OCR/parsing seems to be this mixed ColPali integrating a purposed small vision models and a ColBERT type indexing for retrieval. So - if search is the intended use case - it can skip the whole OCR step entirely.

        [1] https://huggingface.co/blog/manu/colpali

jerrygenser a month ago

Azure document AI accuracy I would categorize as high not "mid". Including hand writing. However for the $1.5/1000 pages, it doesn't include layout detection.

The $10/1000 pages model includes layout detection (headers, etc.) as well as key-value pairs and checkbox detection.

I have continued to do proofs of concept with Gemini and GPT, and in general any new multimodal model that comes out but have it is not on par with the checkbox detection of azure.

In fact the results from Gemini/GPT4 aren't even good enough to use as a teacher for distillation of a "small" multimodal model specializing in layout/checkbox.

I would like to also shout out surya OCR which is up and coming. It's source available and free for under a certain funding or revenue milestone - I think $5m. It doesn't have word level detection yet but it's one of the more promising non-hyper scaler/ heavy commercial OCR tools I'm aware of.

  • ianhawes a month ago

    Surya OCR is great in my test use cases! Hoping to try it out in production soon.

ndr_ a month ago

Prompts in the background:

  const systemPrompt = `
    Convert the following PDF page to markdown. 
    Return only the markdown with no explanation text. 
    Do not exclude any content from the page.
  `;
For each subsequent page: messages.push({ role: "system", content: `Markdown must maintain consistent formatting with the following page: \n\n """${priorPage}"""`, });

Could be handy for general-purpose frontend tools.

  • markous a month ago

    so this is just a wrapper around gpt-4o mini?

beklein 2 months ago

Very interesting project, thank you for sharing.

Are you supporting the Batch API from OpenAI? This would lower costs by 50%. Many OCR tasks are not time-sensitive, so this might be a very good tradeoff.

  • themanmaran a month ago

    That's definitely the plan. Using batch requests would definitely move this closer to $2/1000 pages mark. Which is effectively the AWS pricing.

surfingdino a month ago

Xerox tried it a while ago. It didn't end well https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

  • merb a month ago

    > This is not an OCR problem (as we switched off OCR on purpose)

    • yjftsjthsd-h a month ago

      It also says

      > This is not an OCR problem, but of course, I can't have a look into the software itself, maybe OCR is still fiddling with the data even though we switched it off.

      But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.

      • mlyle a month ago

        > It also says...

        It was a problem with employing the JBIG2 compression codec, which cuts and pastes things from different parts of the page to save space.

        > But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.

        Anyone trying to solve for the contents of a page uses context clues. Even humans reading.

        You can OCR raw characters (performance is poor); use letter frequency information; use a dictionary; use word frequencies; or use even more context to know what content is more likely. More context is going to result in many fewer errors (of course, it may result in a bigger proportion of the remaining errors seeming to have significant meaning changes).

        A small LLM is just a good way to encode this kind of "how likely are these given alternatives" knowledge.

        • tensor a month ago

          Traditional OCR neural networks like tesseract crucially they have strong measures of their accuracy levels, including when they employ dictionaries or the like to help with accuracy. LLMs, on the other hand, give you zero guarantees, and have some pretty insane edge cases.

          With a traditional OCR architecture maybe you'll get a symbol or two wrong, but an LLM can give you entirely new words or numbers not in the document, or even omit sections of the document. I'd never use an LLM for OCR like this.

          • mlyle a month ago

            If you use LLM stupidly, sure. You can get from the LLM pseudo-probabilities of next symbol and use e.g Bayes rule to combine the information of how well it matches the page. You can also report the total uncertainty at the end.

            Done properly, this should strictly improve the results.

        • surfingdino a month ago

          It's all fun and games until you need to prove something in court or to the tax office. I don't think that throwing an LLM into this mix helps.

          • wmf a month ago

            Generally when OCRing documents you should keep the original scans so you can refer back to them in case of any questions or disputes.

      • qingcharles a month ago

        It depends what your use-case is. At a low enough cost this would work for a project I'm doing where I really just need to be able to mostly search large documents. 100% accuracy and a lost or hallucinated paragraph here and there wouldn't be a deal-killer, especially if the original page image is available to the user too.

        And additionally, this also might work if you are feeding the output into a bunch of humans to proof.

  • ctm92 a month ago

    That was also what first came to my mind, I guess Zerox might be a reference to this

bearjaws a month ago

I did this for images using Tesseract for OCR + Ollama for AI.

Check it out, https://cluttr.ai

Runs entirely in browser, using OPFS + WASM.

binalpatel a month ago

You can do some really cool things now with these models, like ask them to extract not just the text but figures/graphs as nodes/edges and it works very well. Back when GPT-4 with vision came out I tried this with a simple prompt + dumping in a pydantic schema of what I wanted and it was spot on, pretty much this (before json mode was a supported):

    You are an expert in PDFs. You are helping a user extract text from a PDF.

    Extract the text from the image as a structured json output.

    Extract the data using the following schema:

    {Page.model_json_schema()}

    Example:
    {{
      "title": "Title",
      "page_number": 1,
      "sections": [
        ...
      ],
      "figures": [
        ...
      ]
    }}

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/
amluto a month ago

My intuition is that the best solution here would be a division of labor: have the big multimodal model identify tables, paragraphs, etc, and output a mapping between segments of the document and texture output. Then a much simpler model that doesn’t try to hold entire conversations can process those segments into their contents.

This will perform worse in cases where whatever understanding the large model has of the contents is needed to recognize indistinct symbols. But it will avoid cases where that very same understanding causes contents to be understood incorrectly due to the model’s assumptions of what the contents should be.

At least in my limited experiments with Claude, it’s easy for models to lose track of where they’re looking on the page and to omit things entirely. But if segmentation of the page is explicit, one can enforce that all contents end up in exactly one segment.

aman2k4 a month ago

I am using AWS Textract + LLM (OpenAI/Claude) to read grocery receipts for <https://www.5outapp.com>

So far, I have collected over 500 receipts from around 10 countries with 30 different supermarkets in 5 different languages.

What has worked for me so far is having control over OCR and processing (for formatting/structuring) separately. I don't have the figures to provide a cost structure, but I'm looking for other solutions to improve both speed and accuracy. Also, I need to figure out a way to put a metric around accuracy. I will definitely give this a shot. Thanks a lot.

  • sleno a month ago

    Cool design. FYI the "Try now" card looks like it didn't render right, just seeing a blank box around the button.

    • aman2k4 a month ago

      You meant in the web version? it is supposed to look like a blank box in the rectangle grocery bill shape, but i suppose the design can be a bit better there. Thanks for the feedback.

      • sumedh a month ago

        The current design with that box feels broken

        • aman2k4 a month ago

          Ok, thanks for the feedback. Will think of something else

refulgentis 2 months ago

Fwiw have on good sourcing that OpenAI supplies Tesseract output to the LLM, so you're in a great place, best of all worlds

lootsauce a month ago

In my own experiments I have had major failures where much of the text is fabricated by the LLM to the point where I just find it hard to trust even with great prompt engineering. What I have been very impressed with is it’s ability to take medium quality ocr from acrobat with poor formatting, lots of errors and punctuation problems and render 100% accurate and properly formatted output by simply asking it to correct the ocr output. This approach using traditional cheap ocr for grounding might be a really robust and cheap option.

jimmyechan a month ago

Congrats! Cool project! I’d been curious about whether GPT would be good for this task. Looks like this answers it!

Why did you choose markdown? Did you try other output formats and see if you get better results?

Also, I wonder how HMTL performs. It would be a way to handle tables with groupings/merged cells

  • themanmaran a month ago

    I think that I'll add an optional configuration for HTML vs Markdown. Which at the end of the day will just prompt the model differently.

    I've not seen a meaningful difference between either, except when it comes to tables. It seems like HTML tends to outperform markdown tables, especially when you have a lot of complexity (i.e. tables within tables, lots of subheaders).

josefritzishere a month ago

Xerox might want to have a word with you about that name.

ReD_CoDE a month ago

It seems that there's a need for a benchmark to compare all solutions available in the market based on the quality and price

The majority of comments are related to prices and qualities

Also, is there any movements about product detection? These days I'm looking for solutions that can recognize goods in high accuracy and show [brand][product_name][variant]

samuell a month ago

The problem I've not found one OCR solution to handle well is complex column based layouts in magazines. Perhaps one problem is that there are often images spanning anything from one to all columns, and so the text might flow in sometimes funny ways. But in this day and age, this must be possible to handle for the best AI-based tools?

jagermo a month ago

ohh, that could finally be a great way to get my ttrpg books readable for kindle. I'll give it a try, thanks for that.

8organicbits 2 months ago

> And 6 months from now it'll be fast, cheap, and probably more reliable!

I like the optimism.

I've needed to include human review when using previous generation OCR software; when I needed the results to be accurate. It's painstaking, but the OCR offered a speedup over fully-manual transcription. Have you given any thought to human-in-the-loop processes?

  • themanmaran a month ago

    I've been surprised so far by llms capability, so I hope it continues.

    On the human in loop side, it's really use case specific. For a lot of my company's work, it's focused on getting trends from large sets of documents.

    Ex: "categorize building permits by municipality". If the OCR was wrong on a few documents, it's still going to capture the general trend. If the use case was "pull bank account info from wire forms" I would want a lot more double checking. But that said, humans also have a tendency to transpose numbers incorrectly.

    • raisedbyninjas a month ago

      Our human in the loop process with traditional OCR uses confidence scores from regions of interest and the page coordinates to speed-up the review process. I wish the LLM could provide that, but both seem far off on the horizon.

    • 8organicbits a month ago

      Hmm, sounds like different goals. I don't work on that project any longer but it was a very small set of documents and they needed to be transcribed perfectly. Every typo in the original needed to be preserved.

      That said, there's huge value in lossy transcription elsewhere, as long as you can account for the errors they introduce.

  • throwthrowuknow a month ago

    Have you tried using the GraphRAG approach of just rerunning the same prompts multiple times and then giving the results along with a prompt to the model telling it to extract the true text and fix any mistakes? With mini this seems like a very workable solution. You could even incorporate one or more attempts from whatever OCR you were using previously.

    I think that is one of the key findings from GraphRAG paper: the gpt can replace the human in the loop.

downrightmike 2 months ago

Does it also produce a confidence number?

  • ndr_ a month ago

    The only thing close are the "logprobs": https://cookbook.openai.com/examples/using_logprobs

    However, commenters around here noted that these have likely not been fine-tuned to correlate with accuracy - for plaintext LLM uses. Would be interested in hearing finding for MLLM use-cases!

  • tensor a month ago

    No, there is no vision LLM that produces confidence numbers to my knowledge.

  • wildzzz a month ago

    The AI says it's 100% confident that it's hallucinations are correct.

  • ravetcofx a month ago

    I don't think openAI's api for gpt4o-mini has any such mechanism.

ravetcofx a month ago

I'd be more curious to see the performance over local models like LLaVa etc.

ipkstef a month ago

I think i'm missing something.. why would i pay to ocr the images when i can do it locally for free? Tesseract runs pretty well on just cpu, wouldn't even need something crazy powerful.

  • daemonologist a month ago

    Tesseract works great for pure label-the-characters OCR, which is sufficient for books and other sources with straightforward layouts, but doesn't handle weird layouts (tables, columns, tables with columns in each cell, etc.) People will do absolutely depraved stuff with Word and PDF documents and you often need semantic understanding to decipher it.

    That said, sometimes no amount of understanding will improve the OCR output because a structure in a document cannot be converted to a one-dimensional string (short of using HTML/CSS or something). Maybe we'll get image -> HTML models eventually.

  • gregolo a month ago

    And OpenAI uses Tesseract in the background, as it sometimes answers that Hungarian language is not installed for Tesseract for me

    • s5ma6n a month ago

      I would be extremely surprised if that's the case. There are "open-source" multimodal LLMs can extract text from images as a proof that the idea works.

      Probably the model is hallucinating and adding "Hungarian language is not installed for Tesseract" to the response.

cmpaul 2 months ago

Great example of how LLMs are eliminating/simplifying giant swathes of complex tech.

I would love to use this in a project if it could also caption embedded images to produce something for RAG...

  • hpen 2 months ago

    Yay! Now we can use more RAM, Network, Energy, etc to do the same thing! I just love hot phones!

    • hpen a month ago

      Oops guess I'm not sippin' the koolaid huh?

throwthrowuknow a month ago

Have you compared the results to special purpose OCR free models that do image to text with layout? My intuition is mini should be just as good if not better.

jdthedisciple a month ago

Very nice, seem to work pretty well!

Just

    maintainFormat: true
did not seem to have any effect in my testing.
fudged71 a month ago

Llama 3.1 now has images support right? Could this be adapted there as well, maybe with groq for speed?

  • daemonologist a month ago

    Meta trained a vision encoder (page 54 of the Llama 3.1 paper) but has not released it as far as I can tell.

  • themanmaran a month ago

    Yup! I want to evaluate a couple different model options over time. Which should be pretty simple!

    The main thing we're doing is converting documents to a series of images, and then aggregating the response. So we should be model agnostic pretty soon.

daft_pink a month ago

I would really love something like this that could be run locally.

murmansk a month ago

Man, this is just an awesome hack! Keep it up!

  • murmansk a month ago

    Or not a man, sorry for putting your identity into a bucket.