I made a PDF diff tool for myself for one of my internships >10 years ago.
One of my responsibilities was to get sign-off on any changes engineers made to schematics. (It was a really bad internship! I ended up automating a bunch of stuff for them because my actual job was mind-numbingly boring.)
After the first few times comparing PDFs of schematics by eye, I said screw it and wrote a tool to export to PNGs and then compare them pixel-by-pixel. It saved me a ton of time and my sanity.
Rasterization prior to comparison is a really clever idea. Doing this with PDF spec code would be an absolute nightmare. Checking 2 bitmaps for equality is just a few lines of code.
Many times, you aren't really concerned with the specific differences being described natively. You just want to know if a human would perceive differences between 2 copies of the "same" form.
I wrote a tool to do this for circuit board artwork changes. It used gerbv to rasterize the vector (gerber) files, then used XOR on the pixels to make unchanged areas disappear. For such a simple tool it is quite useful.
> After the first few times comparing PDFs of schematics by eye, I said screw it and wrote a tool to export to PNGs and then compare them pixel-by-pixel. It saved me a ton of time and my sanity.
It saved mine! I usually work with books and sometimes it's really difficult to tell 10 people to check every single page if something has been different to a previous pdf I sent. They will never be able to find all the differences because we're humans, not bot. We make mistakes.
https://draftable.com is my go-to tool in this space. It understands text flow and the difference between structural and formatting differences, and is amazing when given PDFs of legal docs that you need to retroactively get a redline for. Incredible tool, free but not open source.
Nice work! We had to solve this problem a few years ago and iterated through a bunch of technical solutions and rabbit holes.
Funnily enough, the best performing solution for users was having the two PDFs rendered as images with a slider to "switch" between them, and relying on the human eye to spot the differences.
Thank you! :) you do not know how many issues I spotted on older books I formatted. Paragraphes randomly moved on the top, images streched as I was build more pdf.. all human errors that can be - at least - spotted.
Kasyan Servetsky created a script[1] to compare two documents using a similar technique within InDesign, no intermediate export necessary. It overlays one on top of the other using the Difference blend mode. Quite resource intensive, so only useful depending on the particular document and machine you do this with.
I'm also a big fan and user of Beyond Compare, but I had no idea that it can also compare those files. The above file options are also not there in the app, especially I was looking for pdf.
Would stitching all the images together into a single continuous image before comparison be better? I mean you would trade off memory but would you not avoid page break errors?
On Windows I was quite happy with DiffPDF[1]. As far as I remember there used to be an older open source version as well. But I could not found it on first glance.
Most important for me is the ability to ignore page boundaries. Removing the first half of the first page of one source document shouldn't produce differences on all following pages.
> It's automatic. As if there were no page boundaries, like in text diffs). But for a paged-based diff-tool it would already be quite useful, if the user could interactively specify pages to ignore, so that the comparison can start again where both docs are the same (e.g. at the start of the next chapter).
Gotcha. I might transform this project from a command-line only to a visual tool. I could implement that, thanks for the idea. I can actually have some sections of the PDF ignored.
It's automatic. As if there were no page boundaries, like in text diffs).
But for a paged-based diff-tool it would already be quite useful, if the user could interactively specify pages to ignore, so that the comparison can start again where both docs are the same (e.g. at the start of the next chapter).
Thanks for your comment! By complicated PDFs, what do you mean exactly? A lot of pages? With forms? I've tried with standard technical books I have bought.
The pdf-diff tool simply captures a sort of "screenshot" of a PDF page and then compares it against to the page from the second PDF file you'd like to compare. Since it works by comparing raw pixels, I guess it'll work with complicated PDFs.
Now "working" does not mean it's a good result for the eye and this is why I asked for help regarding some issues.
ImageMagick's `compare` can do this as well, on a pixel-level.
The issue is that in some cases, for example when the change is that a long sentence has been inserted in the middle of the document, the following headline and some of the first paragraph of the following section move to the next page. Then everything that follows is just marked red and the diff becomes useless.
Yes, if it causes things to shift, the tool shows every page red. This tool is meant to be ran in the final phase of typesetting, the moment when you want to focus attention on details (e.g. spacing).
I made a PDF diff tool for myself for one of my internships >10 years ago.
One of my responsibilities was to get sign-off on any changes engineers made to schematics. (It was a really bad internship! I ended up automating a bunch of stuff for them because my actual job was mind-numbingly boring.)
After the first few times comparing PDFs of schematics by eye, I said screw it and wrote a tool to export to PNGs and then compare them pixel-by-pixel. It saved me a ton of time and my sanity.
Rasterization prior to comparison is a really clever idea. Doing this with PDF spec code would be an absolute nightmare. Checking 2 bitmaps for equality is just a few lines of code.
Many times, you aren't really concerned with the specific differences being described natively. You just want to know if a human would perceive differences between 2 copies of the "same" form.
I wrote a tool to do this for circuit board artwork changes. It used gerbv to rasterize the vector (gerber) files, then used XOR on the pixels to make unchanged areas disappear. For such a simple tool it is quite useful.
> After the first few times comparing PDFs of schematics by eye, I said screw it and wrote a tool to export to PNGs and then compare them pixel-by-pixel. It saved me a ton of time and my sanity.
It saved mine! I usually work with books and sometimes it's really difficult to tell 10 people to check every single page if something has been different to a previous pdf I sent. They will never be able to find all the differences because we're humans, not bot. We make mistakes.
https://draftable.com is my go-to tool in this space. It understands text flow and the difference between structural and formatting differences, and is amazing when given PDFs of legal docs that you need to retroactively get a redline for. Incredible tool, free but not open source.
Nice work! We had to solve this problem a few years ago and iterated through a bunch of technical solutions and rabbit holes.
Funnily enough, the best performing solution for users was having the two PDFs rendered as images with a slider to "switch" between them, and relying on the human eye to spot the differences.
Thank you! :) you do not know how many issues I spotted on older books I formatted. Paragraphes randomly moved on the top, images streched as I was build more pdf.. all human errors that can be - at least - spotted.
Kasyan Servetsky created a script[1] to compare two documents using a similar technique within InDesign, no intermediate export necessary. It overlays one on top of the other using the Difference blend mode. Quite resource intensive, so only useful depending on the particular document and machine you do this with.
[1]: http://kasyan.ho.ua/indesign/all/compare_two_documents.html
Really similar to one I have already seen on Photoshop. My goal was comparing the pdfs without any Adobe, my editors do not have license for that.
I just use Beyond Compare from Scooter Software. Best program ever. Been using it for 20 years now. I swear by it as an evangelist user.
Diff images, Excel documents, PDFs, whatever.
> Diff images, Excel documents, PDF
I'm also a big fan and user of Beyond Compare, but I had no idea that it can also compare those files. The above file options are also not there in the app, especially I was looking for pdf.
See also diff-pdf https://vslavik.github.io/diff-pdf/ that I usually use for this. (Seems to be relatively popular; 1.4k stars on GitHub: https://github.com/vslavik/diff-pdf)
I took inspiration from that repository too! :)
Would stitching all the images together into a single continuous image before comparison be better? I mean you would trade off memory but would you not avoid page break errors?
Probably. It's really similar to the approach that munificentbob used in his tool (http://journal.stuffwithstuff.com/2021/07/29/640-pages-in-15...). It uses mainly a Photoshop action.
This approach would be pretty cool for a bunch of file formats as long as you can rasterize them
On Windows I was quite happy with DiffPDF[1]. As far as I remember there used to be an older open source version as well. But I could not found it on first glance.
[1] https://www.qtrac.eu/diffpdf.html
That would be http://www.qtrac.eu/diffpdf-foss.html
There is a fork here: https://gitlab.com/eang/diffpdf
diffpdf doesn't look very pretty but it sure is useful. I use it for all my legal paperwork. Glad OP is contributing to this space.
https://screenshots.debian.net/package/diffpdf
This reminds me of this: https://news.ycombinator.com/item?id=28191343
Substracting PDFs with ImageMagick to test if a change in code produces the same output PDFs
i-net PDFC is my favorite tool in this space (after having tried a dozen others). Awesome features, multi-platform. Free trial version and an online demo available. https://www.inetsoftware.de/products/pdf-content-comparer
Not affiliated in any way, just a happy user.
And none of the other tools I've tried was significantly better than the old GPL-licensed version of DiffPDF: http://www.qtrac.eu/diffpdf-foss.html
Thanks for the comment! I think it's not too difficult to build something similar (even with a web server so you can make comparison on fly).
I'm playing with alpha compositing, I'm going to commit some little changes to enchant the experience a little bit.
Most important for me is the ability to ignore page boundaries. Removing the first half of the first page of one source document shouldn't produce differences on all following pages.
Thanks for the feedback. Should the user specify the page boundaries or does pdf-diff automatically checks where page boundaries are?
> It's automatic. As if there were no page boundaries, like in text diffs). But for a paged-based diff-tool it would already be quite useful, if the user could interactively specify pages to ignore, so that the comparison can start again where both docs are the same (e.g. at the start of the next chapter).
Gotcha. I might transform this project from a command-line only to a visual tool. I could implement that, thanks for the idea. I can actually have some sections of the PDF ignored.
It's automatic. As if there were no page boundaries, like in text diffs). But for a paged-based diff-tool it would already be quite useful, if the user could interactively specify pages to ignore, so that the comparison can start again where both docs are the same (e.g. at the start of the next chapter).
https://www.parepdf.com/ is pretty nice.
It looks like I need to watch it with 3d glasses, but I like it!
Did you try it with more complicated PDFs?
Thanks for your comment! By complicated PDFs, what do you mean exactly? A lot of pages? With forms? I've tried with standard technical books I have bought.
The pdf-diff tool simply captures a sort of "screenshot" of a PDF page and then compares it against to the page from the second PDF file you'd like to compare. Since it works by comparing raw pixels, I guess it'll work with complicated PDFs.
Now "working" does not mean it's a good result for the eye and this is why I asked for help regarding some issues.
ImageMagick's `compare` can do this as well, on a pixel-level.
The issue is that in some cases, for example when the change is that a long sentence has been inserted in the middle of the document, the following headline and some of the first paragraph of the following section move to the next page. Then everything that follows is just marked red and the diff becomes useless.
> ImageMagick's `compare` can do this as well, on a pixel-level.
Yes, thanks for the comment. I wanted to build something with golang :)
Aha. If one paragraph is added on page 1, it might show a diff on every following page in the same chapter after that, if it causes things to shift?
It's not easy to decide when such a change needs review and when it doesn't
Yes, if it causes things to shift, the tool shows every page red. This tool is meant to be ran in the final phase of typesetting, the moment when you want to focus attention on details (e.g. spacing).
Do you compare via OCR?