Question

我想从我的计算机中收集所有PDF文件，并从每个文件中提取文本。我目前拥有的两个功能都可以这样做，但是，一些PDF文件却给了我这个错误：

raise PDFPasswordIncorrect 
pdfminer.pdfdocument.PDFPasswordIncorrect

我在打开和读取PDF文件的功能中提出了错误，似乎可以忽略该错误，但现在忽略了所有PDF文件，包括以前没有问题的PDF文件。

我如何做到这一点，使其仅忽略出现此错误的PDF文件，而不是忽略每个PDF？

def pdfparser(x):
    try:
        raise PDFPasswordIncorrect(pdfminer.pdfdocument.PDFPasswordIncorrect)
        fp = open(x, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = io.StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
    except (RuntimeError, TypeError, NameError,ValueError,IOError,IndexError,PermissionError):
         print("Error processing {}".format(name))

    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()

    return(data)

    def pdfs(files):
            for name in files:
                    try:
                        IP_list = (pdfparser(name))
                        keyword = re.findall(inp,IP_list)
                        file_dict['keyword'].append(keyword)
                        file_dict['name'].append(name.name[0:])
                        file_dict['created'].append(time.ctime(name.stat().st_ctime))
                        file_dict['modified'].append(time.ctime(name.stat().st_mtime))
                        file_dict['path'].append(name)
                        file_dict["content"].append(IP_list)
                    except (RuntimeError, TypeError, NameError,ValueError,IOError,IndexError,PermissionError):
                        print("Error processing {}".format(name))
                    #print(file_dict)
            return(file_dict)
    pdfs(files)

Answer 1

为什么如果您未提供正确的密码而打开了受密码保护的Pdf，则会手动引发错误吗？

每次您的代码都会引发该错误！

相反，如果发生错误，则需要捕获该错误并跳过该文件。查看更正的代码：

 // we create a ball object - an object has properties
var ball = {
  x:  width/2,
  y: height/2,
  speed_x: random(-5, 5),
  speed_y: random(-5, 5),
  size: 20
}

def pdfparser(x): try: # try to open your pdf here - do not raise the error yourself! # if it happens, catch and handle it as well except PDFPasswordIncorrect as e: # catch PDFPasswordIncorrect print("Error processing {}: {}".format(name,e)) # with all other errors # no sense in doing anything if you got an error until here return None # do something with your pdf and collect data data = [] return(data) def pdfs(files): for name in files: try: IP_list = pdfparser(name) if IP_list is None: # unable to read for whatever reasons continue # process next file # do stuff with your data if you got some # most of these errors are already handled inside pdfparser except (RuntimeError, TypeError, NameError,ValueError, IOError,IndexError,PermissionError): print("Error processing {}".format(name)) return(file_dict) pdfs(files)中的第二个try/catch:可以缩小，所有与文件相关的错误都在def pdfs(files):内部发生并在那里处理。您的其余代码不完整，并引用了我不了解的内容：

def pdfparser(x):

如何使用PDFminer避免密码错误的PDF文件

1 个答案: