Ask HN: Indic languages OCR and searchable pdfs

31 points by the-mitr 14 days ago

I am look for applications that can perform OCR on scanned images having Indic scripts (Devanagari, Tamil etc) and create a searchable pdf as an output. There are several applications which can extract the text from images, but is there any application which can create searchable pdf?

epirogov 14 days ago

I saw Hindi in free make pdf searchable app:

https://products.aspose.app/pdf/searchable

so that, I think, it possible to extend it to Devanagari on your local with Tesseract and Aspose.Pdf with C# code snippet:

CallBackGetHocr recognizeText = (System.Drawing.Image img) => { string tmpFile = Path.Combine(outputFolder, Path.GetFileName(Path.GetTempFileName())); using System.Drawing.Bitmap bmp = new System.Drawing.Bitmap(img); bmp.Save(tmpFile);

                    string pathTempFile = $"\"{tmpFile}\"";
                    string arguments = $"{pathTempFile} {pathTempFile} --oem 1 -l {lang} hocr";

                    System.Diagnostics.ProcessStartInfo psi =
                        new System.Diagnostics.ProcessStartInfo("tesseract", arguments);

                    using (System.Diagnostics.Process p = new System.Diagnostics.Process())
                    {
                        p.StartInfo = psi;
                        p.Start();                        p.WaitForExit();
                    }

                    return File.ReadAllText($"{tmpFile}.hocr");
            };
   
   new Aspose.Pdf.Document("my_Devanagari_scan.pdf").Convert(recognizeText);
  • the-mitr 13 days ago

    thanks will give it a shot

nicodjimenez 13 days ago

Check out mathpix.com

(disclaimer: I'm a founder)