我想将输出数据保存到文本文件中,其中每一行都显示在不同的行中。当前每一行都以\ n分隔,我希望将新行保存在不同的行中。
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
PDF_file = "F:/ABC/Doc_1.pdf"
pages = convert_from_path(PDF_file, 500)
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1
filelimit = image_counter-1
outfile = "F:/ABC/intermediate_steps/out_text.txt"
f = open(outfile, "a")
for i in range(1, 2):
filename = "page_"+str(i)+".jpg"
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"\ABC\opencv-text-detection\Tesseract-OCR\tesseract.exe"
from pytesseract import pytesseract
text = str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
#text = text.splitlines()
f.writelines("Data Extracted from next page starts now.")
f.writelines(str(text.encode('utf-8')))
f.close()
例如:-
ABC
DEF
GHI
当前输出:-
ABC\nDEF\nGHI\n
答案 0 :(得分:0)
完成时
f.writelines(str(text.encode('utf-8')))
您将换行符\ n转换为其转义版本\\ n。您应该只使用
f.writelines(text)