Make PDFs searchable

I searched for a good way to make scanned documents searchable. Most newer scanning software already has some OCR built-in, but what about all the old documents? Using pdfsandwich and Tesseract, we recover the text from each page of a PDF and put it behind each page as an invisible layer. That way, we can search the PDF with a normal PDF reader or upload it to Google translate to get a translated version. To get a text-only version, pdftotext can be used.

First, install the missing packages (tested on Ubuntu 12.04):


# we use tesseract-ocr-deu for German
apt-get install tesseract-ocr tesseract-ocr-deu poppler-utils
apt-get install exactimage imagemagick ghostscript

Second, we download and install pdfsandwich:


wget http://downloads.sourceforge.net/project/pdfsandwich/pdfsandwich%200.0.7/\
pdfsandwich_0.0.7_amd64.deb
dpkg -i pdfsandwich_0.0.7_amd64.deb

Finally, we run pdfsandwich and pdftotext on a PDF:


pdfsandwich -resolution 240x240 -rgb -lang deu german_document.pdf
# creates german_document_ocr.pdf with colors and 240dpi

pdftotext german_document_ocr.pdf
# gives german_document_ocr.txt

To process all PDFs in the current directory, find can be used:


find . -name "*.pdf" -exec pdfsandwich -resolution 240x240 -rgb -lang deu {} \;

I Love PHP | CSS | HTML

Make PDFs searchable

3 comments:

Categories