如何忽略来自PDFMiner的错误,以避免破坏Python脚本

时间:2019-04-30 12:28:21

标签: python python-3.x pdfminer

我对文件处理尤其是PDF还是很陌生。我目前已经安装了PDFminer.six,并测试了一些从PDF文件提取文本的功能。我还有另一个函数,该函数接收PDF文件列表,然后调用第一个PDF提取函数以从每个文件中提取所有文本。

问题是,我有很多PDF文件,并且该脚本似乎在每次遇到新错误时都会中断。不管它是无法识别的字符,不同的编码还是加密等,都很难去查找导致错误的PDF文件。

我如何做到这一点,以便无论错误类型如何,脚本都可以继续运行?我可以将PDF提取功能设置为忽略任何类型的错误吗?或者,也许是我的代码中缺少可以帮助我解决该问题的内容?

p = Path("C:/Users/Hugo Caldeira/Desktop")
inp = r"((?<=|^)[0-9]{3}-[0-9]{2}-[0-9]{4}(?=|$))"

file_dict = {
    "name" : [],
    "created" : [],
    "modified" : [],
    'path' : [],
    'content' : [],
    'keyword' : []
}

files = list(p.rglob('*pdf'))

def pdfparser(file):
    fp = open(file, 'rb')
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec,  laparams=laparams)
    #Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    #Process each page contained in the document.
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()

    return(data)


def pdfs(files):
    for name in files:
            #print(name)
            IP_list = (pdfparser(name))
            #print(IP_list)
            keyword = re.findall(inp,IP_list)
            #print(ip_test)
            file_dict['keyword'].append(keyword)
            file_dict['name'].append(name.name[0:])
            file_dict['created'].append(time.ctime(name.stat().st_ctime))
            file_dict['modified'].append(time.ctime(name.stat().st_mtime))
            file_dict['path'].append(name)
            file_dict["content"].append(IP_list)
            #print(file_dict)
    return(file_dict)

pdfs(files)

def to_xlsx():
    df = pd.DataFrame.from_dict(file_dict)
    df.head()
    df.to_excel("pdftest.xlsx")

if __name__ == "__main__":
    to_xlsx()

我当前遇到的错误是:

Traceback (most recent call last):
  File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 67, in <module>
    print(pdfparser(p))
  File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 32, in pdfparser
    fp = open(file, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Hugo Caldeira\\Desktop\\test_folder\\Desktop'

(base) C:\Users\Hugo Caldeira\Desktop\Scripts>"C:/Users/Hugo Caldeira/Anaconda3/python.exe" "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py"
Traceback (most recent call last):
  File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 64, in <module>
    pdfs(files)
  File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 52, in pdfs
    IP_list = (pdfparser(name))
  File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 42, in pdfparser
    for page in PDFPage.get_pages(fp):
  File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfpage.py", line 129, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 577, in __init__
    self._initialize_password(password)
  File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 603, in _initialize_password
    handler = factory(docid, param, password)
  File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 303, in __init__
    self.init()
  File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 310, in init
    self.init_key()
  File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 325, in init_key
    raise PDFPasswordIncorrect
pdfminer.pdfdocument.PDFPasswordIncorrect

我之前遇到的其他错误是:

PDFSyntaxError: No /Root object! - Is this really a PDF?

1 个答案:

答案 0 :(得分:0)

使用try和except。

https://docs.python.org/3.7/tutorial/errors.html#handling-exceptions

在您的except子句中,确保输出文件名和异常。