It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Tesseract OCR Software Tutorial
A step-by-step guide for users to learn how to use Tesseract open-source software for performing optical character recognition (OCR) on a text corpus.
Once your files are in TIFF form and the images transformed to enhance the text, you can extract the information in that file into several formats such as TXT or HTML.
The code is very simple:
tesseract input_file.tiff output
To create a searchable pdf you can input the same code with one change:
tesseract input_file.tiff output_file pdf
Try this code using the Pre-Health Requirements for CUNY Brooklyn document.
Because the file is already very clear, the basic output is accurate. For differently formatted documents or documents in other languages, you can add more parameters to increase the accuracy of Tesseract.
for the full list of supported languages enter --list -langs into the terminal
oem integer 0-3
0legacy engine only
1neutral nets long short-term memory engine only
2 legacy and long short-term memory engine
3default, based on what is available
psm integer 0-13
0 orientation and script detection only
1 automatic page segmentation with OSD
2 automatic page segmentation, but no OSD, or OCR
3default, fully automatic page segmentation, but no OSD
4 assume a single column of text of variable sizes.
5 assume a single uniform block of vertically aligned text.
6 assume a single uniform block of text.
7 treat the image as a single text line.
8 treat the image as a single word.
9 treat the image as a single word in a circle.
10 treat the image as a single character.
11 sparse text. Find as much text as possible in no particular order.
12 sparse text with OSD.
13 raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
These parameters allow for other configurations, such as changing the output. For a full list, you can enter tesseract --print-parameters into the terminal.
Some important parameters:
tessedit_write_unlv 0 .unlv output file
tessedit_create_txt 0 .txt output file (default)
tessedit_create_hocr0 .html hOCR output file
tessedit_create_tsv0 .tsv output file
tessedit_create_pdf0 .pdf output file
textonly_pdf0 creates PDF with only one invisible text layer
To convert a file written in a different language, like the Sanskrit example, type in:
tesseract /Path/to/file/sanskrit.png output pdf -l san
Note: You can use more than one language in Tesseract, however, the order matters and can change the output of the document. For example, -l eng+chi_tra will have a different output than -l chi_tra+eng. For more, see the Tesseract command-line tutorial.