Question

当我尝试使用下面的代码读取包含数据的pdf文件时，两列或行之间没有空格。

import PyPDF2 
pdfFileObj = open('filename.pdf', 'rb',)
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(0)
pageObj.extractText()

输出就是这样

'Page 1 of 1MINISTRY OF CORPORATE AFFAIRSRECEIPTG.A.R.7SRN :U16571275Payment 
made into :Service Request Date :03/08/2017Received From :'

预计在1之后，在A.R.7之后以及在＆＃34; U16571275＆＃34;之间的空间。和＆＃34;付款＆＃34;

Answer 1

extractText()

Method返回页面文本的字符串，有时文本提取可能不完美。

如果您尝试在python中阅读PDF文件，您也可以尝试 Textract 模块。 http://textract.readthedocs.io/en/stable/index.html

pip install textract

安装后

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

PyPDF2文件阅读器返回没有空格的数据

1 个答案: