I have a book scan that came back from a book scanning company.
The scan images were fine, but the OCR text in the PDF are whacky, due to eccentric fonts, dirt, etc.
So I’m going to have to go through this by hand and tidy up.
I have tried a lot of FOSS PDF editors on this particular PDF, but none of them work as well as an old copy of Foxit PhantomPDF (an old version of the product currently named Foxit PDF Editor) that I have on a dying laptop.
I’ve tried the following commonly recommended FOSS PDF editors without much success:
LibreOffice Draw - Many text fields in wrong layer order. Page images not visible.
PDFEdit - Loads the file as blank
Scribus - Won’t load the file
FireFox - only allows annotation changes
Inkscape - It sort of works, but it’s not oriented towards text editing, so looking and editing text is cumbersome.
I have had good results with Tesseract. I had to export the PDF to individual jpegs, then batch OCR’d them with tesseract, then merged the individual pages back into a single PDF. If you don’t want to use command line and are okay with it not being open source, PDF24.org does a good job and does not charge.
If you want to host it locally, Stirling PDF can be run in docker, and uses a library that uses Tesseract. Has a bunch of other handy PDF operations, too. I keep it around for the two times a year I need to merge, split, or decrypt PDFs.
https://github.com/Frooodle/Stirling-PDF/blob/main/HowToUseOCR.md
It can do it straight from PDF and do multiple files at a time.
This is amazing. Did not realize it existed. Thank you for sharing