HI,
I am working on a project to OCR text from tiff images, the below code works fine on individual images, but I am looking for a solution where I can extract the batch images from respective subfolders and OCR in .HOCR format.
Example :
There are several subfolders in the D drive with Tiff image, which needs to pass through OCR one by one and output in E drive with the similar DIR tree as the D drive.
D:\\subfolder\Subfolder1\tiff image to E:\subfolder\Subfolder1\Hocr image
Please suggest how to tweak the code to achieve the requirement
My code
Joe
I am working on a project to OCR text from tiff images, the below code works fine on individual images, but I am looking for a solution where I can extract the batch images from respective subfolders and OCR in .HOCR format.
Example :
There are several subfolders in the D drive with Tiff image, which needs to pass through OCR one by one and output in E drive with the similar DIR tree as the D drive.
D:\\subfolder\Subfolder1\tiff image to E:\subfolder\Subfolder1\Hocr image
Please suggest how to tweak the code to achieve the requirement
My code
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- OCR\tesseract.exe"
image = Image.open(r"C:\Users\multipage.tiff")
config = ("--oem 3 --psm 6")
txt = ''
for frame in range(image.n_frames):
image.seek(frame)
txt += pytesseract.image_to_string(image, config = config, lang='eng') + '\n'
print(txt)
with open(r"C:\Users\multipage_output.txt", mode = 'w') as f:
f.write(txt)Thanks!Joe
