Now that you've installed all the packages you will need, we can manipulate and convert the files. Because Tesseract is for adding text layers, it is best to check if there is already a text layer present. We can check this using Xpdf which will output a .txt document upon performing this task. This is also a helpful tool if you wish to just obtain the text in a file.
In the terminal, input this code (with the information of where you put the document):
pdftotext /Path/to/document/verweij_2015.pdf verweij_2015.txt
Note : Another way to find out the path of the document, you can drag the file into the terminal and it will do it for you.
This will output a text file under the name verweij_2015.txt. You could also change the name to whatever you want here. As you can see, this PDF already has text embedded.
To see what happens when a file does not have text embedded, type into the terminal:
pdftotext /Path/to/document/prehealth_reqs.pdf prehealth_reqs.txt
The file will come up blank. Because If this PDF does not already have embedded text, then it needs to be converted to a TIFF file before Tesseract can extract the text. Converting the document is simple, just enter:
convert /Path/to/document/prehealth_reqs.pdf prehealth_reqs.tiff
There are also some image manipulations that can be done during conversion to improve the quality of the TIFF file.
convert -density 300 /Path/to/document/prehealth_reqs.pdf -depth 8 -strip -background white -alpha off prehealth_reqs.tiff
Here is a list of what each command means:
Again, other names can be used for outputs. To convert a PNG or JPEG, the same code can be used so long as the extension is changed in the first part.
Note: If the input PDF has multiple pages, the resulting TIFF file will represent each page of the original PDF as a separate TIFF layer.