Skip to Main Content

Tesseract OCR Software Tutorial

A step-by-step guide for users to learn how to use Tesseract open-source software for performing optical character recognition (OCR) on a text corpus.

Basic Tesseract Usage

Once your files are in TIFF form and the images transformed to enhance the text, you can extract the information in that file into several formats such as TXT or HTML.

The code is very simple:

tesseract input_file.tiff output

To create a searchable pdf you can input the same code with one change:

tesseract input_file.tiff output_file pdf

Try this code using the Pre-Health Requirements for CUNY Brooklyn document.

Because the file is already very clear, the basic output is accurate. For differently formatted documents or documents in other languages, you can add more parameters to increase the accuracy of Tesseract.

tesseract input_file.tiff output.txt [-l language] [--oem OCR engine mode] [--psm page segmentation mode] [configfile...]

Commonly used parameters available:

  • l language
    • for the full list of supported languages enter --list -langs into the terminal
  • oem integer 0-3
    •  0  legacy engine only
    •  1  neutral nets long short-term memory engine only
    •  2     legacy and long short-term memory engine
    •  3  default, based on what is available
  • psm integer 0-13
    •  0    orientation and script detection only
    •  1    automatic page segmentation with OSD
    •  2    automatic page segmentation, but no OSD, or OCR
    •  3    default, fully automatic page segmentation, but no OSD 
    •  4    assume a single column of text of variable sizes.
    •  5    assume a single uniform block of vertically aligned text.
    •  6    assume a single uniform block of text.
    •  7    treat the image as a single text line.
    •  8    treat the image as a single word.
    •  9    treat the image as a single word in a circle.
    • 10   treat the image as a single character.
    • 11   sparse text. Find as much text as possible in no particular order.
    • 12   sparse text with OSD.
    • 13   raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
  • configfile
    • These parameters allow for other configurations, such as changing the output. For a full list, you can enter tesseract --print-parameters into the terminal.
    • Some important parameters:
      • tessedit_write_unlv 0    .unlv output file
      • tessedit_create_txt 0    .txt output file (default)
      • tessedit_create_hocr 0  .html hOCR output file
      • tessedit_create_tsv 0    .tsv output file
      • tessedit_create_pdf 0    .pdf output file
      • textonly_pdf 0 creates PDF with only one invisible text layer
 
To convert a file written in a different language, like the Sanskrit example, type in:
tesseract /Path/to/file/sanskrit.png output pdf -l san
 
Note: You can use more than one language in Tesseract, however, the order matters and can change the output of the document. For example, -l eng+chi_tra will have a different output than -l chi_tra+eng. For more, see the Tesseract command-line tutorial.
 

Using Tesseract to Automate Processing Many Files

To convert multiple files in one step, run the following bash command from within the folder containing the input files (or, alternatively, use an absolute path when defining the directory to crawl in the "for" part of this loop:

for file in *.jpg; do tesseract $file ${file/.jpg/}; done

This will crawl the current directory you are running the bash command from, matching on all file names with .jpg in the extension (in other words, that contain .jpg at the end of the filename's string). Adjust accordingly if you are dealing with another file type (e.g. png) or have filenames that lack extensions. Keep the asterisk, which functions as a wildcard to match on all characters in the filename.

Within the "do" portion of this loop, you are running Tesseract as you normally would, with the command followed by the input filename, then desired output filename. The input filename is set by pulling the assigned variable, e.g. the filename itself. This output filename is created by removing the .jpg extension, leaving just the remainder of the filename. Tesseract would then append .txt to the stub filename by default. This can also be adjusted accordingly to suit the desired behavior, including using an absolute path to place the Tesseract output files somewhere else on the system.