PDF Editor for repairing book scan OCR?

zabadoh@lemmy.ml · edit-2 11 months ago

PDF Editor for repairing book scan OCR?

sibloure@beehaw.org · 11 months ago

I have had good results with Tesseract. I had to export the PDF to individual jpegs, then batch OCR’d them with tesseract, then merged the individual pages back into a single PDF. If you don’t want to use command line and are okay with it not being open source, PDF24.org does a good job and does not charge.

ChickenBoo@lemmy.jnks.xyz · edit-2 11 months ago

If you want to host it locally, Stirling PDF can be run in docker, and uses a library that uses Tesseract. Has a bunch of other handy PDF operations, too. I keep it around for the two times a year I need to merge, split, or decrypt PDFs.

https://github.com/Frooodle/Stirling-PDF/blob/main/HowToUseOCR.md

It can do it straight from PDF and do multiple files at a time.

sibloure@beehaw.org · 10 months ago

This is amazing. Did not realize it existed. Thank you for sharing