Question

您好社区成员，

我想从.pdf作为文件扩展名的电子书中提取所有文本。我知道python有一个软件包PyPDF2来执行必要的操作。我以某种方式尝试并能够提取文本，但是它导致提取的单词之间存在不适当的空间，有时结果是2-3个合并单词的结果。

此外，我想从第3页开始提取文本，因为初始页面涉及封面和前言。另外，我也不想包含最后5页，因为它包含词汇表和索引。

还有没有其他方法可以读取没有加密的.pdf二进制文件？

该代码段，如下所示。

import PyPDF2
def Read():
    pdfFileObj = open('F:\\Pen Drive 8 GB\\PDF\\Handbooks\\book1.pdf','rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    #discerning the number of pages will allow us to parse through all #the pages
    num_pages = pdfReader.numPages
    count = 0
    global text
    text = []
    while(count < num_pages):
         pageObj = pdfReader.getPage(count)
         count +=1
         text += pageObj.extractText().split()
         print(text)
 Read()

Answer 1

这是一个可能的解决方案：

import PyPDF2

def Read(startPage, endPage):
    global text
    text = []
    cleanText = ""
    pdfFileObj = open('myTest2.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    while startPage <= endPage:
        pageObj = pdfReader.getPage(startPage)
        text += pageObj.extractText()
        startPage += 1
    pdfFileObj.close()
    for myWord in text:
        if myWord != '\n':
            cleanText += myWord
    text = cleanText.split()
    print(text)

Read(0,0)

Read（）参数->读取（要读取的第一页，要读取的最后一页）

注意：要读取第一页，是从0开始而不是从1开始（例如在数组中）。

在python中从pdf文件读取和提取文本时单词之间没有空格？

1 个答案: