Question

我有一个超过100页的pdf文件。有文本框和列。当我使用PyPdf2和tika解析器提取文本时，会得到一串乱序的数据。在许多情况下，它是按列排序的，在其他情况下，它会在文档中跳过。是否可以从顶部开始阅读pdf文件，从左向右移动直到底部？我想阅读列和框中的文本，但是我希望显示的文本行可以从左到右阅读。

我尝试过： PyPDF2-唯一的工具是extracttext（）。速度快，但元素之间没有差距。结果很混乱。

Pdfminer-具有LAParams的PDFPageInterpeter（）方法。这很好，但是很慢。每页至少2秒钟，我有200页。

pdfrw-这只能告诉我页面数。

tabula_py-只给我第一页。也许我没有正确循环。

tika-目前正在使用的工具。快速且可读性强，但内容仍然混乱。

from tkinter import filedialog
import os
from tika import parser
import re

# select the file you want 
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
                                    # top of each page

for i in range(1,len(by_page)): # loop page by page
    info = by_page[i] # get one page worth of data from the pdf
    reformated = info.replace("\n", "&") # I replace the new lines with     "&" to make it more readable
    print("Page: ",i) # print page number
    print(reformated,"\n\n") # print the text string from the pdf

这提供了某种输出，但是没有按照我想要的方式排序。我希望从左到右阅读pdf。另外，如果我可以获得纯python解决方案，那将是一个好处。我不希望我的最终用户被迫安装Java（我认为tika和tabula-py方法依赖于Java）。

Answer 1

我使用此代码为.docx进行了此操作。其中txt是.docx。希望这个帮助link

import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)

print(new)

有没有一种方法可以逐行读取pdf文件？

1 个答案: