解析pdf元数据日期不适用于所有pdf

时间:2019-03-13 19:10:01

标签: python datetime parsing pdf metadata

我正在尝试使用pdfminer获取多个pdf的修改日期

import os
import re
from datetime import datetime
from pdfminer3.pdfparser import PDFParser
from pdfminer3.pdfdocument import PDFDocument


# This function convers the date string to a datetime object
def get_pdf_date(pd):
    dtformat = "%Y%m%d%H%M%S"
    clean = pd.decode("utf-8").replace("D:", "").split('+')[0]
    return datetime.strptime(re.sub('[^0-9]', '', clean), dtformat)


path = "C:\\Users\\asus\\Desktop\\storage"
for file in os.listdir(path):
    try:
        fp = open(os.path.join(path, file), "rb")
        parser = PDFParser(fp)
        doc = PDFDocument(parser)
        pdf_creation_date = doc.info[0]["CreationDate"]
        print(str(pdf_creation_date) + ", " + str(get_pdf_date(pdf_creation_date)))
    except Exception as e:
        print(str(e) + " => " + str(pdf_creation_date)) 

这是我得到的输出:

b“ D:20151004081456 + 01'00'”,2015-10-04 08:14:56

b'D:20161029124239',2016-10-29 12:42:39

b“ D:20160727173724 + 05'30'”,2016-07-27 17:37:24

b“ D:20170526150059 + 05'30'”,2017-05-26 15:00:59

b'D:20190218122459',2019-02-18 12:24:59

未转换的数据仍然为:0600 => b“ D:20151017020552-06'00'”

b“ D:20180302120823 + 00'00'”,2018-03-02 12:08:23

b“ D:20150317171945 + 05'30'”,2015-03-17 17:19:45

b“ D:20140405150714 + 01'00'”,2014-04-05 15:07:14

b'D:20190313161243Z',2019-03-13 16:12:43

b'D:20160523204913',2016-05-23 20:49:13

b” D:20150716000009 + 05'30'”,2015-07-16 00:00:09

b” D:20150923145114 + 05'30'”,2015-09-23 14:51:14 b“ D:20150703193510 + 05'30'”,2015-07-03 19:35:10

b“ D:20170907220317 + 16'33'”,2017-09-07 22:03:17

未转换的数据仍为:1200 => b“ D:20160407192544-12'00'”

如您所见,我使用的解析功能并非始终有效,那是因为每个pdf似乎都有自己的日期语法。但是我注意到Foxit Reader总是正确获取元数据,如下图所示

enter image description here

所以我想知道如何实现这种东西

1 个答案:

答案 0 :(得分:0)

失败日期的时区偏移量带有减号:

D:20160407192544-12'00'

代码中的这一行只需要加号(或者隐含地没有时区偏移量):

clean = pd.decode("utf-8").replace("D:", "").split('+')[0]

您的代码需要同时处理正时和负时区偏移。