pdfgrep - shell-script to search through PDF documents

tags: programming (August 26, 2008)

This little shell script searches for the occurrence of a text string (pattern) in PDF documents. Example calls are

$ pdfgrep "cauliflower" recipes/*.pdf
$ pdfgrep "speed of light" `find publications/ -name Einstein_*.pdf`

The first one searches for all cauliflowers in recipes/*.pdf, the second one recursively searches through all publications from Einstein for the speed of light.

The script prints out the file names of all documents that contain the pattern preceeded with the number of occurrences.

#! /bin/bash

# usage: pdfgrep pattern files

pattern="$1"
shift 1

echo -e "# count\tdocument"
for pdf in $*; do
    count=$(pdftotext "$pdf" - |grep -c -i "$pattern")
    [[ $count > 0 ]] && echo -e "$count\t$pdf"
done