这是我的输入内容:
info = subprocess.run(['pdfinfo', 'test.pdf'], stdout=subprocess.PIPE)
这是info
的输出:
b'Title: Aboriginal Custom Adoption Recognition\r\nAuthor:
Department of Justice\r\nCreator: PScript5.dll Version
5.2.2\r\nProducer: Acrobat Distiller 10.0.0 (Windows)\r\nCreationDate:
Wed Feb 20 11:12:48 2013 Eastern Standard Time\r\nModDate: Wed Feb 20
11:12:55 2013 Eastern Standard Time\r\nTagged: no\r\nUserProperties:
no\r\nSuspects: no\r\nForm: none\r\nJavaScript:
no\r\nPages: 6\r\nEncrypted: no\r\nPage size: 612 x 792
pts (letter)\r\nPage rot: 0\r\nFile size: 20059
bytes\r\nOptimized: no\r\nPDF version: 1.5\r\n'
我正在寻找Pages: 6
的整数值(因此pdf中的页数)。有没有办法通过子流程来解决这个问题?如果没有,关于我拥有大量pdf时如何持续获取该价值的任何建议?
答案 0 :(得分:1)
只需使用正则表达式即可抓取'Pages: '
之后的整数。
import re
print(re.findall(r'^Pages:\s+(\d+)', info.stdout.read().decode('utf-8'), flags=re.MULTILINE)[0])