我正在尝试使用pdfminer获取多个pdf的修改日期
import os
import re
from datetime import datetime
from pdfminer3.pdfparser import PDFParser
from pdfminer3.pdfdocument import PDFDocument
# This function convers the date string to a datetime object
def get_pdf_date(pd):
dtformat = "%Y%m%d%H%M%S"
clean = pd.decode("utf-8").replace("D:", "").split('+')[0]
return datetime.strptime(re.sub('[^0-9]', '', clean), dtformat)
path = "C:\\Users\\asus\\Desktop\\storage"
for file in os.listdir(path):
try:
fp = open(os.path.join(path, file), "rb")
parser = PDFParser(fp)
doc = PDFDocument(parser)
pdf_creation_date = doc.info[0]["CreationDate"]
print(str(pdf_creation_date) + ", " + str(get_pdf_date(pdf_creation_date)))
except Exception as e:
print(str(e) + " => " + str(pdf_creation_date))
这是我得到的输出:
b“ D:20151004081456 + 01'00'”,2015-10-04 08:14:56
b'D:20161029124239',2016-10-29 12:42:39
b“ D:20160727173724 + 05'30'”,2016-07-27 17:37:24
b“ D:20170526150059 + 05'30'”,2017-05-26 15:00:59
b'D:20190218122459',2019-02-18 12:24:59
未转换的数据仍然为:0600 => b“ D:20151017020552-06'00'”
b“ D:20180302120823 + 00'00'”,2018-03-02 12:08:23
b“ D:20150317171945 + 05'30'”,2015-03-17 17:19:45
b“ D:20140405150714 + 01'00'”,2014-04-05 15:07:14
b'D:20190313161243Z',2019-03-13 16:12:43
b'D:20160523204913',2016-05-23 20:49:13
b” D:20150716000009 + 05'30'”,2015-07-16 00:00:09
b” D:20150923145114 + 05'30'”,2015-09-23 14:51:14 b“ D:20150703193510 + 05'30'”,2015-07-03 19:35:10
b“ D:20170907220317 + 16'33'”,2017-09-07 22:03:17
未转换的数据仍为:1200 => b“ D:20160407192544-12'00'”
如您所见,我使用的解析功能并非始终有效,那是因为每个pdf似乎都有自己的日期语法。但是我注意到Foxit Reader总是正确获取元数据,如下图所示
所以我想知道如何实现这种东西
答案 0 :(得分:0)
失败日期的时区偏移量带有减号:
D:20160407192544-12'00'
代码中的这一行只需要加号(或者隐含地没有时区偏移量):
clean = pd.decode("utf-8").replace("D:", "").split('+')[0]
您的代码需要同时处理正时和负时区偏移。