I've worked with famous python packages for PDF files, like PDFminer, PyMuPDF, PyPDF2 and more. But none of them can extract text correctly from PDF files which are written in right-to-left languages (Persian, Arabic).
For example:
import fitz
doc = fitz.open("*/path/to/file.pdf")
txt = doc.getPageText(0)
print(txt)
it returns something like this:
...
اﯾﻨﺘﺮﻧﺖ و ﮐﺎﻣﭙﯿﻮﺗﺮ ﺑﻪ ﻣﺴﻠﻂ
ﻣﺴﻠﻂ ﻫﺎیزﺑﺎن
...
Sometimes the words are written reversed (first character comes last) and the words are swapped in a sentence, sometimes words are written correctly. But it does not know how to handle the Zero-width non-joiner (نیمفاصله) which is commonly used in Persian.
I tried a lot, But came to nothing. Thanks for your helps, in advance.
答案 0 :(得分:0)
我遇到了这个问题,并编写了以下代码:
import sys
from builtins import print
import fitz, enchant
input_file = "p.pdf"
line_list = []
doc = fitz.Document(input_file)
page_count = doc.pageCount
for i in range(page_count):
load_page = doc.loadPage(i)
page = load_page.getText() # read a page
page = str(page)
line_list.append(page.splitlines()) # split every page based on \n
for j in range (len(line_list)):
for k in range(3):
line_list[j][k] = line_list[j][k][::-1]
print(line_list[j][k])
但是此软件包有两个问题。 1)反转我在此代码中解决的单词(例如“سلام”->“مالس”)。 2)波斯语和英语等多语言文档存在问题。
我希望