我正在研究Python上的PDF,我正在使用PDFMiner
访问文件的元数据。我使用这个提取信息:
from pdfminer.pdfparser import PDFParser, PDFDocument
fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()
print doc.info[0]['CreationDate']
# And return this value "D:20130501200439+01'00'"
如何在Python中将D:20130501200439+01'00'
转换为可读格式?
答案 0 :(得分:6)
“+ 01'00”是时区信息吗?不考虑这一点,您可以创建一个datetime对象,如下所示......
>>>from time import mktime, strptime
>>>from datetime import datetime
...
>>>datestring = doc.info[0]['CreationDate'][2:-7]
>>>ts = strptime(datestring, "%Y%m%d%H%M%S")
>>>dt = datetime.fromtimestamp(mktime(ts))
datetime(2013, 5, 1, 20, 4, 30)
答案 1 :(得分:4)
我发现格式记录为here。我也需要应对时区,因为我有来自各地的160k文件来处理。这是我的完整解决方案:
import datetime
import re
from dateutil.tz import tzutc, tzoffset
pdf_date_pattern = re.compile(''.join([
r"(D:)?",
r"(?P<year>\d\d\d\d)",
r"(?P<month>\d\d)",
r"(?P<day>\d\d)",
r"(?P<hour>\d\d)",
r"(?P<minute>\d\d)",
r"(?P<second>\d\d)",
r"(?P<tz_offset>[+-zZ])?",
r"(?P<tz_hour>\d\d)?",
r"'?(?P<tz_minute>\d\d)?'?"]))
def transform_date(date_str):
"""
Convert a pdf date such as "D:20120321183444+07'00'" into a usable datetime
http://www.verypdf.com/pdfinfoeditor/pdf-date-format.htm
(D:YYYYMMDDHHmmSSOHH'mm')
:param date_str: pdf date string
:return: datetime object
"""
global pdf_date_pattern
match = re.match(pdf_date_pattern, date_str)
if match:
date_info = match.groupdict()
for k, v in date_info.iteritems(): # transform values
if v is None:
pass
elif k == 'tz_offset':
date_info[k] = v.lower() # so we can treat Z as z
else:
date_info[k] = int(v)
if date_info['tz_offset'] in ('z', None): # UTC
date_info['tzinfo'] = tzutc()
else:
multiplier = 1 if date_info['tz_offset'] == '+' else -1
date_info['tzinfo'] = tzoffset(None, multiplier*(3600 * date_info['tz_hour'] + 60 * date_info['tz_minute']))
for k in ('tz_offset', 'tz_hour', 'tz_minute'): # no longer needed
del date_info[k]
return datetime.datetime(**date_info)
答案 2 :(得分:0)
猜猜我没有代表对Paul Whipp的说明性回答发表评论,但是我对其进行了修正,以处理一些旧文件中存在的Y2K错误。 2000年是19100年,因此pdf_date_pattern的相关行变为
r"(?P<year>191\d\d|\d\d\d\d)",
并且我在转换值循环中添加了一个省略号:
elif k == 'year' and len(v) == 5:
date_info[k] = int('20' + v[3:])