尝试循环浏览多个PDF文件并提取两个搜索条件之间的文本

时间:2018-07-31 20:36:27

标签: python python-3.x

我试图查看多个PDF文件,查看每个文本,并提取(开始)“ NOTE 1-ORGANIZATION”和“ NOTE 2-ORGANIZATION”(结束)之间的段落。每个文件在此位置具有不同的文本,我想从每个文件中打印每个段落,或将段落保存到文本文件中。

下面,我整理了一个小脚本,该脚本打开一个文件,找到一个字符串,然后打印找到该文本的页面。我认为这是一个不错的开始,但我确实希望遍历许多PDF文件,查找特定的文本正文,并将找到的所有内容保存到单个文本文件中。

import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("C:/my_path/file1.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
String = "New York State Real Property Law"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(String, Text)
    print(ResSearch)

任何对解决此问题的见解都将受到赞赏!

1 个答案:

答案 0 :(得分:1)

如果您的文件名类似于file1.pdf,file2.pdf和...,则可以使用for循环:

import PyPDF2
import re

for k in range(1,100):
    # open the pdf file
    object = PyPDF2.PdfFileReader("C:/my_path/file%s.pdf"%(k))

    # get number of pages
    NumPages = object.getNumPages()

    # define keyterms
    String = "New York State Real Property Law"

    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        print("this is page " + str(i)) 
        Text = PageObj.extractText() 
        # print(Text)
        ResSearch = re.search(String, Text)
        print(ResSearch)

否则,您可以使用os模块浏览文件夹

import PyPDF2
import re
import os

for foldername,subfolders,files in os.walk(r"C:/my_path"):
    for file in files:
        # open the pdf file
        object = PyPDF2.PdfFileReader(os.path.join(foldername,file))

        # get number of pages
        NumPages = object.getNumPages()

        # define keyterms
        String = "New York State Real Property Law"

        # extract text and do the search
        for i in range(0, NumPages):
            PageObj = object.getPage(i)
            print("this is page " + str(i)) 
            Text = PageObj.extractText() 
            # print(Text)
            ResSearch = re.search(String, Text)
            print(ResSearch)

对不起,如果我发现您的问题不对。

编辑:

不幸的是,我对pyPDF2模块不熟悉,但是当您使用此模块转换pdf内容时,似乎会发生一些奇怪的事情(例如,其他换行符或格式更改或...)。

此页面可能有助于: Extracting text from a PDF file using Python

但是,如果您的文件是.txt,则正则表达式会有所帮助

import re
import os
myRegex=re.compile("New York State Real Property Law.*?common elements of the property\.",re.DOTALL)
for foldername,subfolders,files in os.walk(r"C:/Users/Mirana/Me2"):
    for file in files:
        object=open(os.path.join(foldername,file))
        Text=object.read()
        for subText in myRegex.findall(Text):
            print(subText)

object.close()

我也更改了pdf版本,但导致上述问题的原因至少对我的pdf无效(尝试一下):

import PyPDF2
import re
import os

myRegex=re.compile("New York State Real Property Law.*?common elements of the property\.",re.DOTALL)
for foldername,subfolders,files in os.walk(r"C:/my_path"):
    for file in files:
        # open the pdf file
        object = PyPDF2.PdfFileReader(os.path.join(foldername,file))

        # get number of pages
        NumPages = object.getNumPages()

        # extract text and do the search
        for i in range(0, NumPages):
            PageObj = object.getPage(i)
            print("this is page " + str(i)) 
            Text = PageObj.extractText() 
            # print(Text)
        for subText in myRegex.findall(Text):
            print(subText)