Question

使用pypdf查找特定单词在pdf文件中的次数时遇到问题。

在我的代码中，它发现一个单词的出现次数，而一页却只有一次。因此，最大数量是页面数。 “ the”一词的结果应为700，但仅显示30（页面总数为30）。

import PyPDF3
import re
def read_pdf(file,string):
    fils = file.split(".")
    print(fils[1])
    word = string
    if fils[1] == "pdf":
        pdfFileObj = open(file,"rb")
    # open the pdf file
        object = PyPDF3.PdfFileReader(file)
    # get number of pages
        NumPages = object.getNumPages()

    # define keyterms
        counter = 0
    # extract text and do the search
        for i in range(NumPages):
            PageObj = object.getPage(i)
            print("page " + str(i))
            Text = PageObj.extractText()
            #print(Text)
            if word in Text:
                print("The word is on this page")
                counter += 1
        print(word, "exists", counter, "times in the file")

你们能看到我做错了什么并帮助我吗？

谢谢：）

Answer 1

您需要做的是将所有页面中的所有单词收集到一个列表中。
有了单词列表后，您可以使用Counter，在pdf中为您提供单词及其编号

示例：

from collections import Counter

pdf_words = ['the','fox','the','jack']

counter = Counter(pdf_words)
print(counter)

输出：

Counter({'the': 2, 'fox': 1, 'jack': 1})

打印单词在pdf中的次数-python

1 个答案: