I searched for a good way to make scanned documents searchable. Most newer scanning software already has some OCR built-in, but what about all the old documents? Using pdfsandwich and Tesseract, we recover the text from each page of a PDF and put it behind each page as an invisible layer. That way, we can search the PDF with a normal PDF reader or upload it to Google translate to get a translated version. To get a text-only version, pdftotext can be used.
First, install the missing packages (tested on Ubuntu 12.04):Second, we download and install pdfsandwich:
# we use tesseract-ocr-deu for German
apt-get install tesseract-ocr tesseract-ocr-deu poppler-utils
apt-get install exactimage imagemagick ghostscript
Finally, we run pdfsandwich and pdftotext on a PDF:
wget http://downloads.sourceforge.net/project/pdfsandwich/pdfsandwich%200.0.7/\
pdfsandwich_0.0.7_amd64.deb
dpkg -i pdfsandwich_0.0.7_amd64.deb
To process all PDFs in the current directory, find can be used:
pdfsandwich -resolution 240x240 -rgb -lang deu german_document.pdf
# creates german_document_ocr.pdf with colors and 240dpi
pdftotext german_document_ocr.pdf
# gives german_document_ocr.txt
find . -name "*.pdf" -exec pdfsandwich -resolution 240x240 -rgb -lang deu {} \;
Thanks for sharing as it is an excellent post would love to read your future post
ReplyDeleteReact JS Development Company
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDelete