如何使用Python解析使用分隔符从PDF文件中提取的文本?

时间:2017-09-24 10:51:05

标签: python parsing pdf pdf-parsing pypdf2

我已尝试使用以下代码段从PyPDF2中提取和解析PDF文本;

import PyPDF2
import re

pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

rawText = pdfReader.getPage().extractText()
extractedText = re.split('\n|\t', rawText)
print("Extracted Text: " + str(extractedText) + "\n")

案例1:当我尝试解析pdf文本时,我无法像在pdf中那样完全解析它们。例如,

enter image description here

在这种情况下,无法在rawTextextractedText中找到换行符或换行符,结果如下所示 -

    input field, your old automation script will try to submit a form with missing data unless you update it.Another common case is asserting that a specific error message appeared and then updating the error message, which will also break the script.

案例2:对于以下案例,

enter image description here

结果为 -

2B. Community Living5710509-112C. Lifelong Learning69116310-122D. Employment5710509-11

更难以解析和区分这些个人得分。是否可以使用PyPDF2或任何其他Python库完美地解析这些场景?

0 个答案:

没有答案