Once your files are in TIFF form and the images transformed to enhance the text, you can extract the information in that file into several formats such as TXT or HTML.
The code is very simple:
tesseract input_file.tiff output
To create a searchable pdf you can input the same code with one change:
tesseract input_file.tiff output_file pdf
Try this code using the Pre-Health Requirements for CUNY Brooklyn document.
Because the file is already very clear, the basic output is accurate. For differently formatted documents or documents in other languages, you can add more parameters to increase the accuracy of Tesseract.
tesseract input_file.tiff output.txt [-l language] [--oem OCR engine mode] [--psm page segmentation mode] [configfile...]
Commonly used parameters available:
To convert multiple files in one step, run the following bash command from within the folder containing the input files (or, alternatively, use an absolute path when defining the directory to crawl in the "for" part of this loop:
for file in *.jpg; do tesseract $file ${file/.jpg/}; done
This will crawl the current directory you are running the bash command from, matching on all file names with .jpg in the extension (in other words, that contain .jpg at the end of the filename's string). Adjust accordingly if you are dealing with another file type (e.g. png) or have filenames that lack extensions. Keep the asterisk, which functions as a wildcard to match on all characters in the filename.
Within the "do" portion of this loop, you are running Tesseract as you normally would, with the command followed by the input filename, then desired output filename. The input filename is set by pulling the assigned variable, e.g. the filename itself. This output filename is created by removing the .jpg extension, leaving just the remainder of the filename. Tesseract would then append .txt to the stub filename by default. This can also be adjusted accordingly to suit the desired behavior, including using an absolute path to place the Tesseract output files somewhere else on the system.