Saturday, October 15, 2011

OCR with Tesseract

Links: Script which appends each JPG to a larger file :: training information

Note: for extracting text from PDF's, pdftotext is a go-to CLI tool. Can be configured for better line breaks, etc, too.
This page is based on tesseract, but others are available. Here's a quick scanning summary: Time: scan to binary JPG's at 200X200 lines. This makes output files of 600K each, but who cares, since they will be deleted. It takes about 15 seconds per JPG to OCR them. Accuracy: appears about 95%.
  1. Scan at 200x200 binary
  2. $ tesseract input.jpeg output
To append a series of JPG's into a longer text file, there is a script, or one can...
  1. Scan at 200x200 binary
  2. $ tesseract input.jpeg output
  3. concatenate the text files, if in numbered order, with...
    $ cat * > onebigfile.txt
From there, one can create a LaTeX file, etc.

tesseract installation

ARCH: # pacman -S tesseract-data-eng This will pull in English data files, and the entire tesseract program.
SOURCE: Tesseract source was at Google Code (thanks,guys), and was version 3.0 as I wrote this. Per usual, I didn't bother with reading or dependency checks, just gave it a shot.

Standard configure, make, and # make install seemed to go well, but I found on first run that it couldn't find its libraries. I strace'd and saw the wrong directories. Reaccomplished configure as...
$ configure -prefix=/usr
...and all was fine. Or so I thought. I attempted to run the program and it couldn't open the jpeg file I was using for an input. Time to return to Google code and actually go through the ReadMe, apparently.

tesseract installation pt 2

The ReadMe indicated two additional steps to the above would be required: 1) installing "Leptonica", if I wanted tesseract to OCR image files other than TIFF's (eg, jpegs) and, 2) selecting language(s) to place into /usr/share/tessdata following the build and installation.

leptonica installation

The main site for leptonica is another Google code site, but there are also many sites for users getting deeply into it as a physics analysis tool or so on. For my purposes, just downloaded the source (version 1.68 as of this writing) and performed a standard configure, make, and # make install, with the small configure modification of
$ configure -prefix=/usr

tesseract installation pt 3

With leptonica apparently in without a problem, I built tesseract again, in order to let the build process recognize that leptonica was now in place. [Again, leptonica was installed to provide tesseract with the capacity to extract text from file types (eg. jpegs), other than TIFF's.] Tesseract installed smoothly again with a standard configure, make, and # make install, again using the slight modification
$ configure -prefix=/usr

Tesseract language files were available at the same Google code site where we initially retrieved tesseract itself. I selected english and spanish (in order to demonstrate to students), but there appears to be roughly 50 languages available there. They all come as .gz files; simply unzip and then add or remove them as desired from /usr/share/tessdata (following installation of tesseract). Most of us will only need to leave the english file in there -- eng.traineddata.

tesseract summary

The program read text at 100% accuracy, from pages scanned at 300 lines in b/w, and took about 8 seconds (on my old system) per image to convert. Even adding scanning time, this should be significantly more efficient than typing speeds for most people, though maybe those in the 90+ wpm category can just retype, not sure. The installation steps in summary:
  1. Unless only using TIFF inputs (eg faxes), verify installation of leptonica before building tesseract
  2. Download tesseract source, compile and install
  3. Download language files, unpack, and move (requires root) to installed /usr/share/tessdata directory
If converting an image into English text, one needn't specify the language. It's simply...
$ tesseract inputfile.jpeg outputname
...which produces outputname.txt. For iterations, a simple script can roll through image after image.

still to come

Time permitting, I will install the OCRFeeder GUI frontend and see if that adds any pleasing advantages.

forcing lib location during configure

So many times I can't count, I'll compile and install an application and then the application, which compiles without errors, won't be able to locate its libraries. I know the dependencies are there, but why can't the package find its libraries when it installed them? Annoying as hell.

So I run strace <app>, find where the app is looking, run find to locate the libraries, and then create soft links. This solves the problem, but sux when I might have to create 10 or 20 softlinks.

I'd much rather avoid softlinks entirely and use "configure" options to force libs go where they will be found, but there's an apparent Catch-22: I don't know in advance which directories the application will seek its libraries until after its installation. So, although I'd like to force "configure" to install libs to those directories, how do I know where the application will seek them until I install it and attempt to run it? Additionally, is the answer within make or configure?


The answer, it would seem, is to force make to compile the app so that it both looks for its libs where I tell it to look, AND installs its libs into that location. Can it be done?

configure or make?

Theoretically, it should be possible to change the installation directories either through manipulation of make or through configure. In make it would presumably be through a configuration file make.config or some such; in configure, by forcing the prefix each time, eg.
$ configure prefix=/usr
The easier route appears to be to change it in configure. By default, make on my system appends /usr/local. This means, for example, that libs will install into /usr/local/lib. The easiest way to repair this is via configure "prefix" command as noted above. Using the command above, libs would be installed in the /usr/lib instead of /usr/local/lib. It also means however, that the bin file will install into /usr/bin instead of /usr/local/bin.


Although there's very likely something like a make.config file to change its settings, I was unable to locate such a file with some cursory searching. The fast solution appears to be forcing the issue in configure using ,eg.
$ configure prefix=/usr