我对文件处理尤其是PDF还是很陌生。我目前已经安装了PDFminer.six,并测试了一些从PDF文件提取文本的功能。我还有另一个函数,该函数接收PDF文件列表,然后调用第一个PDF提取函数以从每个文件中提取所有文本。
问题是,我有很多PDF文件,并且该脚本似乎在每次遇到新错误时都会中断。不管它是无法识别的字符,不同的编码还是加密等,都很难去查找导致错误的PDF文件。
我如何做到这一点,以便无论错误类型如何,脚本都可以继续运行?我可以将PDF提取功能设置为忽略任何类型的错误吗?或者,也许是我的代码中缺少可以帮助我解决该问题的内容?
p = Path("C:/Users/Hugo Caldeira/Desktop")
inp = r"((?<=|^)[0-9]{3}-[0-9]{2}-[0-9]{4}(?=|$))"
file_dict = {
"name" : [],
"created" : [],
"modified" : [],
'path' : [],
'content' : [],
'keyword' : []
}
files = list(p.rglob('*pdf'))
def pdfparser(file):
fp = open(file, 'rb')
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
#Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
#Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
return(data)
def pdfs(files):
for name in files:
#print(name)
IP_list = (pdfparser(name))
#print(IP_list)
keyword = re.findall(inp,IP_list)
#print(ip_test)
file_dict['keyword'].append(keyword)
file_dict['name'].append(name.name[0:])
file_dict['created'].append(time.ctime(name.stat().st_ctime))
file_dict['modified'].append(time.ctime(name.stat().st_mtime))
file_dict['path'].append(name)
file_dict["content"].append(IP_list)
#print(file_dict)
return(file_dict)
pdfs(files)
def to_xlsx():
df = pd.DataFrame.from_dict(file_dict)
df.head()
df.to_excel("pdftest.xlsx")
if __name__ == "__main__":
to_xlsx()
我当前遇到的错误是:
Traceback (most recent call last):
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 67, in <module>
print(pdfparser(p))
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 32, in pdfparser
fp = open(file, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Hugo Caldeira\\Desktop\\test_folder\\Desktop'
(base) C:\Users\Hugo Caldeira\Desktop\Scripts>"C:/Users/Hugo Caldeira/Anaconda3/python.exe" "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py"
Traceback (most recent call last):
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 64, in <module>
pdfs(files)
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 52, in pdfs
IP_list = (pdfparser(name))
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 42, in pdfparser
for page in PDFPage.get_pages(fp):
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfpage.py", line 129, in get_pages
doc = PDFDocument(parser, password=password, caching=caching)
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 577, in __init__
self._initialize_password(password)
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 603, in _initialize_password
handler = factory(docid, param, password)
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 303, in __init__
self.init()
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 310, in init
self.init_key()
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 325, in init_key
raise PDFPasswordIncorrect
pdfminer.pdfdocument.PDFPasswordIncorrect
我之前遇到的其他错误是:
PDFSyntaxError: No /Root object! - Is this really a PDF?
答案 0 :(得分:0)
使用try和except。
https://docs.python.org/3.7/tutorial/errors.html#handling-exceptions
在您的except子句中,确保输出文件名和异常。