Question

我正在尝试编写一个Python脚本，该脚本将在pdf文件中查找特定的单词。现在，我必须滚动结果以找到找到它的行。

我希望单独包含单词的行被打印或保存为单独的文件。

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("Filename.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
Strings = "House|Property|street"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(Strings, Text)
    print(ResSearch)

运行上面的代码时，我需要在输出中滚动查找找到单词的行。我希望包含单词的行将被打印或保存为单独的文件，或者仅包含行的页面将被保存在单独的pdf或txt文件中。感谢您的提前帮助

Answer 1

在每一页上的文本分隔行之后，可以使用re.match。

例如：

for i in range(0, num_pages):
    page = object.getPage(i)
    text = page.extractText()
    for line in text.splitlines():
        if re.match('House|Property|street', line):
            print(line)

从pdf中搜索多个单词

1 个答案: