It turns out that pdfedit is pretty good at extracting text from pdf files. Here is a script I wrote to do that in batch mode.
#!/bin/sh
# Print the text from a pdf document on stdout
# Copyright: (c) 2006-2010 PDFedit team <http://sourceforge.net/projects/pdfedit>
# Copyright: (c) 2010, David Bremner <david@tethera.net>
# Licensed under version 2 or later of the GNU GPL
set -e
if [ $# -lt 1 ]; then
echo usage: $0 file [pageSep]
exit 1
fi
#!/bin/sh
# Print the text from a pdf document on stdout
# Copyright: © 2006-2010 PDFedit team <http://sourceforge.net/projects/pdfedit>
# Copyright: © 2010, David Bremner <david@tethera.net>
# Licensed under version 2 or later of the GNU GPL
set -e
if [ $# -lt 1 ]; then
echo usage: $0 file [pageSep]
exit 1
fi
/usr/bin/pdfedit -console -eval '
function onConsoleStart() {
var inName = takeParameter();
var pageSep = takeParameter();
var doc = loadPdf(inName,false);
pages=doc.getPageCount();
for (i=1;i<=pages;i++) {
pg=doc.getPage(i);
text=pg.getText();
print(text);
print("\n");
print(pageSep);
}
}
' $1 $2
Yeah, I wish #!/usr/bin/pdfedit
worked too. Thanks to Aaron M Ucko for pointing out that
-eval could replace the use of a temporary file.
Oh, and pdfedit will be even better when the authors release a new version that fixes truncating wide text