Skip to Main Content

Tesseract OCR Software Tutorial

A step-by-step guide for users to learn how to use Tesseract open-source software for performing optical character recognition (OCR) on a text corpus.

Improving Image Quality

Tesseract requires high-quality photos with prominent text. Converting the file to a TIFF already takes care of some of the image improvement, but if your image is skewed or difficult to see, it is best to edit the image beforehand.

ImageMagick is a free software ideal for transforming images on the command line. Some of the transformations can be done during the time you convert the image to TIFF as in the previous section. One of the expansions of ImageMagick is textcleaner, which is a robust tool for editing images within the command line. It can process scanned documents and clean up the background and other aspects of the image. With scripts, there is an order of usage depending on how you want to manipulate the image.

Note that this command assumes that your current working directory contains the textcleaner utility script.

./textcleaner [-r rotate] [-l layout] [-c cropoff] [-g] [-e enhance ] [-f filtersize] [-o offset] [-u] [-t threshold] [-s sharpamt] [-S saturation] [-a adaptblur] [-T] [-p padamt] [-b bgcolor] [-F fuzzval] [-i invert] input_file.* output_file.*

Note : not all parameters will be necessary every time. Each parameter has a default value which means you don't have to specify it each time.

Here is a list of what each parameter means and the values that can be input:

  • cw or  ccw or n
    • rotates the image clockwise (cw) or counterclockwise (ccw) by 90 degrees; the default value is n for no rotation
  • p or  l
    • determines whether the document will be portrait or landscape; the default value is p for portrait
  • c "n1,n2,n3,n4"
    • cropping offset after rotating an image; one number crops all around, two numbers crop tops and sides, respectively, and all four values determine how much is cropped left, top, right, and bottom, respectively
  • g
    • outputs grayscale
  • none or stretch or normalize
    • enhances brightness
  • integer  > 0
    • size of the filter used to clean background
  • integer >= 0
    • an offset of filter that reduces noise
  • u
    • unrotates image
  • 0–100
    • text smoothing threshold
  • float >= 0
    • how much to sharpen the image by in pixels
  • integer >= 0 
    • color saturation
  • a number >= 0
    • alternate text smoothing
  • T
    • trims background around the outer part of the image
  • integer >= 0
    • adds a border
  • color
    • changes background color; the default color is white
  • F  integer >= 0 
    • fuzz value for determining the background color 
  • i or 
    • one-way or two-way inversions; the default is no inversion

Using the CWS Toolkit image that has not been scanned and is askew, we can input this code to clean it up:

./textcleaner -c "50,250,190,250" -g -e stretch -t 30 -s 2 -u -T Path/to/document/cws_toolkits.jpg cws_toolkits.tiff

There are plenty of different ways you could have cleaned up the image. A setback with this method versus a regular image manipulator is that this requires a lot of trial and error.