Ask HN: Indic languages OCR and searchable pdfs
I am look for applications that can perform OCR on scanned images having Indic scripts (Devanagari, Tamil etc) and create a searchable pdf as an output. There are several applications which can extract the text from images, but is there any application which can create searchable pdf?
I saw Hindi in free make pdf searchable app:
https://products.aspose.app/pdf/searchable
so that, I think, it possible to extend it to Devanagari on your local with Tesseract and Aspose.Pdf with C# code snippet:
CallBackGetHocr recognizeText = (System.Drawing.Image img) => { string tmpFile = Path.Combine(outputFolder, Path.GetFileName(Path.GetTempFileName())); using System.Drawing.Bitmap bmp = new System.Drawing.Bitmap(img); bmp.Save(tmpFile);
thanks will give it a shot
https://readcoop.eu/transkribus/
Other things that may help:
https://github.com/dh-tech/awesome-digital-humanities
https://ithaca.deepmind.com/
https://papyri.info/docs/resources
https://glam-workbench.net/
https://programminghistorian.org/
https://github.com/paperless-ngx/paperless-ngx
Check out mathpix.com
(disclaimer: I'm a founder)