Once your files are in TIFF form and the images transformed to enhance the text, you can extract the information in that file into several formats such as TXT or HTML.
The code is very simple:
tesseract input_file.tiff output
To create a searchable pdf you can input the same code with one change:
tesseract input_file.tiff output_file pdf
Try this code using the Pre-Health Requirements for CUNY Brooklyn document.
Because the file is already very clear, the basic output is accurate. For differently formatted documents or documents in other languages, you can add more parameters to increase the accuracy of Tesseract.
tesseract input_file.tiff output.txt [-l language] [--oem OCR engine mode] [--psm page segmentation mode] [configfile...]
From tesseract GitHub:
- l language
- for the full list of supported languages enter --list -langs into the terminal
- oem integer 0-3
- 0 legacy engine only
- 1 neutral nets long short-term memory engine only
- 2 legacy and long short-term memory engine
- 3 default, based on what is available
- psm integer 0-13
- 0 orientation and script detection only
- 1 automatic page segmentation with OSD
- 2 automatic page segmentation, but no OSD, or OCR
- 3 default, fully automatic page segmentation, but no OSD
- 4 assume a single column of text of variable sizes.
- 5 assume a single uniform block of vertically aligned text.
- 6 assume a single uniform block of text.
- 7 treat the image as a single text line.
- 8 treat the image as a single word.
- 9 treat the image as a single word in a circle.
- 10 treat the image as a single character.
- 11 sparse text. Find as much text as possible in no particular order.
- 12 sparse text with OSD.
- 13 raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
- These parameters allow for other configurations, such as changing the output. For a full list, you can enter tesseract --print-parameters into the terminal.
- Some important parameters:
tessedit_write_unlv 0 .unlv output file
tessedit_create_txt 0 .txt output file (default)
tessedit_create_hocr 0 .html hOCR output file
tessedit_create_tsv 0 .tsv output file
tessedit_create_pdf 0 .pdf output file
textonly_pdf 0 creates PDF with only one invisible text layer
To convert a file written in a different language, like the Sanskrit example, type in:
tesseract /Path/to/file/sanskrit.png output pdf -l san
You can use more than one language in Tesseract, however, the order matters and can change the output of the document. For example, -l eng+chi_tra
will have a different output than -l chi_tra+eng. For more, follow this link