Question

I've worked with famous python packages for PDF files, like PDFminer, PyMuPDF, PyPDF2 and more. But none of them can extract text correctly from PDF files which are written in right-to-left languages (Persian, Arabic).

For example:

import fitz
doc = fitz.open("*/path/to/file.pdf")
txt = doc.getPageText(0)
print(txt)

it returns something like this:

...

اﯾﻨﺘﺮﻧﺖ و ﮐﺎﻣﭙﯿﻮﺗﺮ ﺑﻪ ﻣﺴﻠﻂ

ﻣﺴﻠﻂ ﻫﺎیزﺑﺎن

...

Sometimes the words are written reversed (first character comes last) and the words are swapped in a sentence, sometimes words are written correctly. But it does not know how to handle the Zero-width non-joiner (نیم‌فاصله) which is commonly used in Persian.

I tried a lot, But came to nothing. Thanks for your helps, in advance.

Answer 1

我遇到了这个问题，并编写了以下代码：

import sys
from builtins import print
import fitz, enchant

input_file = "p.pdf"
line_list = []

doc = fitz.Document(input_file)
page_count = doc.pageCount

for i in range(page_count):
    load_page = doc.loadPage(i)
    page = load_page.getText() # read a page
    page = str(page)
    line_list.append(page.splitlines()) # split every page based on \n

for j in range (len(line_list)):
    for k in range(3): 
        line_list[j][k] = line_list[j][k][::-1]
        print(line_list[j][k])

但是此软件包有两个问题。 1）反转我在此代码中解决的单词（例如“سلام”->“مالس”）。 2）波斯语和英语等多语言文档存在问题。

我希望

Is there any python package for extracting text nicely from PDFs in RTL-languages?

1 个答案: