Skip to Main Content

Data Services Class Descriptions

Information, materials, and schedules for all currently offered Data Services classes.
This course offers an introduction to extracting and organizing textual and tabular data using the Optical Character Recognition (OCR) softwares Tesseract and ABBYY FineReader. Use of OCR can significantly cut down on data entry and enables digital analysis of non-digital materials.
Software: ABBYY Finereader 14, Tesseract 3
Duration: 90 min

Room description:

Some tutorials are held remotely and require NYU sign on to access, while others are held in person, without a remote component. Please note the correct modality and location of the tutorial when registering

Prerequisites: None
Skills Taught / Learning Outcomes:
  • Extract of text from images of documents using ABBYY and Tesseract
  • Training ABBYY to recognize new fonts from historical documents
  • Image preparation and conversion for OCR readers
Class Materials:

Slides: https://nyu-dataservices.github.io/ExtractingOCR/

Related Classes:

Introduction to ATLAS.ti

Introduction to NVivo

Introduction to R

Additional Training Materials:

Official how-to videos

NYU Libraries ABBYY guide

Feedback: bit.ly/feedbackds

Upcoming sessions for this tutorial