Skip to main content

Tesseract OCR Software Tutorial: Home

A step-by-step guide for users to learn how to use Tesseract open-source software for performing optical character recognition (OCR) on a text corpus.

General Information

NYU Data Services: NYU Libraries and Information Technology logo

 

 

 

Data Services home page

COVID-19 Service Status

Data Services has shifted to virtual services for the Summer 2020 sessions. During our normal working hours, we will respond to requests
via e-mail and hold consultations 
via Zoom when necessary.

Staffed Hours: Summer 2020
   Mondays:       12pm - 6pm
   Tuesdays:       12pm - 6pm
   Wednesdays: 12pm - 6pm
   Thursdays:     12pm - 6pm
   Fridays:          12pm - 4pm

To contact us, submit a request or email data.services@nyu.edu.

If you've met with us before, tell us how we're doing.

@NYUDataServices

Welcome

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats. Tesseract is highly customizable and can operate using most languages, including multilingual documents and vertical text. Although the software can be used on Windows or Linux, this guide will be based on Mac operating systems which is done through the terminal application.

 

The goals of this guide are to learn how to:

  • Download and install Tesseract OCR and a package of other useful tools
  • Convert images from PDFs, JPGs, PNGs, into TIFFs
  • Improve image quality which in turn improves the quality of Tesseract's output
  • Easily use Tesseract
  • Make a whitelist (list of characters permitted)
  • Batch convert (convert many files at once)

Start Here

CC

Creative Commons License
Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.