Saturday, October 15, 2011

OCR with Tesseract

Links: Script which appends each JPG to a larger file :: training information

Note: for extracting text from PDF's, pdftotext is a go-to CLI tool. Can be configured for better line breaks, etc, too.
This page is based on tesseract, but others are available. Here's a quick scanning summary: Time: scan to binary JPG's at 200X200 lines. This makes output files of 600K each, but who cares, since they will be deleted. It takes about 15 seconds per JPG to OCR them. Accuracy: appears about 95%.
  1. Scan at 200x200 binary
  2. $ tesseract input.jpeg output
To append a series of JPG's into a longer text file, there is a script, or one can...
  1. Scan at 200x200 binary
  2. $ tesseract input.jpeg output
  3. concatenate the text files, if in numbered order, with...
    $ cat * > onebigfile.txt
From there, one can create a LaTeX file, etc.

tesseract installation

ARCH: # pacman -S tesseract-data-eng This will pull in English data files, and the entire tesseract program.
SOURCE: Tesseract source was at Google Code (thanks,guys), and was version 3.0 as I wrote this. Per usual, I didn't bother with reading or dependency checks, just gave it a shot.

Standard configure, make, and # make install seemed to go well, but I found on first run that it couldn't find its libraries. I strace'd and saw the wrong directories. Reaccomplished configure as...
$ configure -prefix=/usr
...and all was fine. Or so I thought. I attempted to run the program and it couldn't open the jpeg file I was using for an input. Time to return to Google code and actually go through the ReadMe, apparently.

tesseract installation pt 2

The ReadMe indicated two additional steps to the above would be required: 1) installing "Leptonica", if I wanted tesseract to OCR image files other than TIFF's (eg, jpegs) and, 2) selecting language(s) to place into /usr/share/tessdata following the build and installation.

leptonica installation

The main site for leptonica is another Google code site, but there are also many sites for users getting deeply into it as a physics analysis tool or so on. For my purposes, just downloaded the source (version 1.68 as of this writing) and performed a standard configure, make, and # make install, with the small configure modification of
$ configure -prefix=/usr

tesseract installation pt 3

With leptonica apparently in without a problem, I built tesseract again, in order to let the build process recognize that leptonica was now in place. [Again, leptonica was installed to provide tesseract with the capacity to extract text from file types (eg. jpegs), other than TIFF's.] Tesseract installed smoothly again with a standard configure, make, and # make install, again using the slight modification
$ configure -prefix=/usr

Tesseract language files were available at the same Google code site where we initially retrieved tesseract itself. I selected english and spanish (in order to demonstrate to students), but there appears to be roughly 50 languages available there. They all come as .gz files; simply unzip and then add or remove them as desired from /usr/share/tessdata (following installation of tesseract). Most of us will only need to leave the english file in there -- eng.traineddata.

tesseract summary

The program read text at 100% accuracy, from pages scanned at 300 lines in b/w, and took about 8 seconds (on my old system) per image to convert. Even adding scanning time, this should be significantly more efficient than typing speeds for most people, though maybe those in the 90+ wpm category can just retype, not sure. The installation steps in summary:
  1. Unless only using TIFF inputs (eg faxes), verify installation of leptonica before building tesseract
  2. Download tesseract source, compile and install
  3. Download language files, unpack, and move (requires root) to installed /usr/share/tessdata directory
If converting an image into English text, one needn't specify the language. It's simply...
$ tesseract inputfile.jpeg outputname
...which produces outputname.txt. For iterations, a simple script can roll through image after image.

still to come

Time permitting, I will install the OCRFeeder GUI frontend and see if that adds any pleasing advantages.

No comments: