Skip to Main Content

Tesseract OCR Software Tutorial

A step-by-step guide for users to learn how to use Tesseract open-source software for performing optical character recognition (OCR) on a text corpus.

Converting Images and Files

Now that you've installed all the packages you will need, we can manipulate and convert the files. Because Tesseract is for recognizing text layers, it is best to check if there is already a text layer present. We can check this using Xpdf which will output a .txt document upon performing this task. This is also a helpful tool if you wish to just obtain the text in a file.

In the terminal, input this code (using the path for your stored document on your system):

      pdftotext /Path/to/document/verweij_2015.pdf verweij_2015.txt

      open verweij_2015.txt

Note : Another way to find out the path of the document, you can drag the file into the terminal and it will do it for you.

This will output a text file under the name verweij_2015.txt. You could also change the name to whatever you want here. As you can see, this PDF already has text embedded.

To see what happens when a file does not have text embedded, type into the terminal:

pdftotext /Path/to/document/prehealth_reqs.pdf prehealth_reqs.txt

open prehealth_reqs.txt

The file will come up blank. Because If this PDF does not already have embedded text, then it needs to be converted to a TIFF file before Tesseract can extract the text. Converting the document is simple, just enter:

            convert /Path/to/document/prehealth_reqs.pdf prehealth_reqs.tiff

There are also some image manipulations that can be done during conversion to improve the quality of the TIFF file.

convert -density 300 /Path/to/document/prehealth_reqs.pdf -depth 8 -strip -background white -alpha off prehealth_reqs.tiff

Here is a list of what each command means:

  • convert
    • converts document from one file format to another
  • density width or  widthxheight
    • controls image resolution 
  • depth  value
    • controls depth of the image
  • strip
    • strips document of any comments or other extraneous information
  • background  color
    • sets the background color
  • alpha type
    • controls transparency of a color–if it is off it means that the source color will not be visible

Again, other names can be used for outputs. To convert a PNG or JPEG, the same code can be used so long as the extension is changed in the first part.

Note: If the input PDF has multiple pages, the resulting TIFF file will represent each page of the original PDF as a separate TIFF layer.