rcfox 3 years ago

I made a PDF diff tool for myself for one of my internships >10 years ago.

One of my responsibilities was to get sign-off on any changes engineers made to schematics. (It was a really bad internship! I ended up automating a bunch of stuff for them because my actual job was mind-numbingly boring.)

After the first few times comparing PDFs of schematics by eye, I said screw it and wrote a tool to export to PNGs and then compare them pixel-by-pixel. It saved me a ton of time and my sanity.

  • bob1029 3 years ago

    Rasterization prior to comparison is a really clever idea. Doing this with PDF spec code would be an absolute nightmare. Checking 2 bitmaps for equality is just a few lines of code.

    Many times, you aren't really concerned with the specific differences being described natively. You just want to know if a human would perceive differences between 2 copies of the "same" form.

    • markrages 3 years ago

      I wrote a tool to do this for circuit board artwork changes. It used gerbv to rasterize the vector (gerber) files, then used XOR on the pixels to make unchanged areas disappear. For such a simple tool it is quite useful.

  • crecker 3 years ago

    > After the first few times comparing PDFs of schematics by eye, I said screw it and wrote a tool to export to PNGs and then compare them pixel-by-pixel. It saved me a ton of time and my sanity.

    It saved mine! I usually work with books and sometimes it's really difficult to tell 10 people to check every single page if something has been different to a previous pdf I sent. They will never be able to find all the differences because we're humans, not bot. We make mistakes.

btown 3 years ago

https://draftable.com is my go-to tool in this space. It understands text flow and the difference between structural and formatting differences, and is amazing when given PDFs of legal docs that you need to retroactively get a redline for. Incredible tool, free but not open source.

pea 3 years ago

Nice work! We had to solve this problem a few years ago and iterated through a bunch of technical solutions and rabbit holes.

Funnily enough, the best performing solution for users was having the two PDFs rendered as images with a slider to "switch" between them, and relying on the human eye to spot the differences.

  • crecker 3 years ago

    Thank you! :) you do not know how many issues I spotted on older books I formatted. Paragraphes randomly moved on the top, images streched as I was build more pdf.. all human errors that can be - at least - spotted.

elmimmo 3 years ago

Kasyan Servetsky created a script[1] to compare two documents using a similar technique within InDesign, no intermediate export necessary. It overlays one on top of the other using the Difference blend mode. Quite resource intensive, so only useful depending on the particular document and machine you do this with.

[1]: http://kasyan.ho.ua/indesign/all/compare_two_documents.html

  • crecker 3 years ago

    Really similar to one I have already seen on Photoshop. My goal was comparing the pdfs without any Adobe, my editors do not have license for that.

JakeAl 3 years ago

I just use Beyond Compare from Scooter Software. Best program ever. Been using it for 20 years now. I swear by it as an evangelist user.

Diff images, Excel documents, PDFs, whatever.

  • mandeepj 3 years ago

    > Diff images, Excel documents, PDF

    I'm also a big fan and user of Beyond Compare, but I had no idea that it can also compare those files. The above file options are also not there in the app, especially I was looking for pdf.

christkv 3 years ago

Would stitching all the images together into a single continuous image before comparison be better? I mean you would trade off memory but would you not avoid page break errors?

smartmic 3 years ago

On Windows I was quite happy with DiffPDF[1]. As far as I remember there used to be an older open source version as well. But I could not found it on first glance.

[1] https://www.qtrac.eu/diffpdf.html

dloss 3 years ago

i-net PDFC is my favorite tool in this space (after having tried a dozen others). Awesome features, multi-platform. Free trial version and an online demo available. https://www.inetsoftware.de/products/pdf-content-comparer

Not affiliated in any way, just a happy user.

  • crecker 3 years ago

    Thanks for the comment! I think it's not too difficult to build something similar (even with a web server so you can make comparison on fly).

    I'm playing with alpha compositing, I'm going to commit some little changes to enchant the experience a little bit.

    • dloss 3 years ago

      Most important for me is the ability to ignore page boundaries. Removing the first half of the first page of one source document shouldn't produce differences on all following pages.

      • crecker 3 years ago

        Thanks for the feedback. Should the user specify the page boundaries or does pdf-diff automatically checks where page boundaries are?

        • crecker 3 years ago

          > It's automatic. As if there were no page boundaries, like in text diffs). But for a paged-based diff-tool it would already be quite useful, if the user could interactively specify pages to ignore, so that the comparison can start again where both docs are the same (e.g. at the start of the next chapter).

          Gotcha. I might transform this project from a command-line only to a visual tool. I could implement that, thanks for the idea. I can actually have some sections of the PDF ignored.

        • dloss 3 years ago

          It's automatic. As if there were no page boundaries, like in text diffs). But for a paged-based diff-tool it would already be quite useful, if the user could interactively specify pages to ignore, so that the comparison can start again where both docs are the same (e.g. at the start of the next chapter).

iamandras 3 years ago

Did you try it with more complicated PDFs?

  • crecker 3 years ago

    Thanks for your comment! By complicated PDFs, what do you mean exactly? A lot of pages? With forms? I've tried with standard technical books I have bought.

    The pdf-diff tool simply captures a sort of "screenshot" of a PDF page and then compares it against to the page from the second PDF file you'd like to compare. Since it works by comparing raw pixels, I guess it'll work with complicated PDFs.

    Now "working" does not mean it's a good result for the eye and this is why I asked for help regarding some issues.

    • mr_mitm 3 years ago

      ImageMagick's `compare` can do this as well, on a pixel-level.

      The issue is that in some cases, for example when the change is that a long sentence has been inserted in the middle of the document, the following headline and some of the first paragraph of the following section move to the next page. Then everything that follows is just marked red and the diff becomes useless.

      • crecker 3 years ago

        > ImageMagick's `compare` can do this as well, on a pixel-level.

        Yes, thanks for the comment. I wanted to build something with golang :)

    • kzrdude 3 years ago

      Aha. If one paragraph is added on page 1, it might show a diff on every following page in the same chapter after that, if it causes things to shift?

      It's not easy to decide when such a change needs review and when it doesn't

      • crecker 3 years ago

        Yes, if it causes things to shift, the tool shows every page red. This tool is meant to be ran in the final phase of typesetting, the moment when you want to focus attention on details (e.g. spacing).

avipars 3 years ago

Do you compare via OCR?