It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Tesseract OCR Software Tutorial
A step-by-step guide for users to learn how to use Tesseract open-source software for performing optical character recognition (OCR) on a text corpus.
Data Services continues virtual services in Fall 2020. During our working hours, we will respond to requests via e-mail and hold consultations via Zoom. Chat for immediate assistance during our staffed hours.
Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats. Tesseract is highly customizable and can operate using most languages, including multilingual documents and vertical text. Although the software can be used on Windows or Linux, this guide will be based on Mac operating systems which is done through the terminal application.
The goals of this guide are to learn how to:
Download and install Tesseract OCR and a package of other useful tools
Convert images from PDFs, JPGs, PNGs, into TIFFs
Improve image quality which in turn improves the quality of Tesseract's output