Question

这是我的输入内容：

info = subprocess.run(['pdfinfo', 'test.pdf'], stdout=subprocess.PIPE)

这是info的输出：

b'Title:          Aboriginal Custom Adoption Recognition\r\nAuthor:         
Department of Justice\r\nCreator:        PScript5.dll Version 
5.2.2\r\nProducer:       Acrobat Distiller 10.0.0 (Windows)\r\nCreationDate:     
Wed Feb 20 11:12:48 2013 Eastern Standard Time\r\nModDate:        Wed Feb 20 
11:12:55 2013 Eastern Standard Time\r\nTagged:         no\r\nUserProperties: 
no\r\nSuspects:       no\r\nForm:           none\r\nJavaScript:     
no\r\nPages:          6\r\nEncrypted:      no\r\nPage size:      612 x 792 
pts (letter)\r\nPage rot:       0\r\nFile size:      20059 
bytes\r\nOptimized:      no\r\nPDF version:    1.5\r\n'

我正在寻找Pages: 6的整数值（因此pdf中的页数）。有没有办法通过子流程来解决这个问题？如果没有，关于我拥有大量pdf时如何持续获取该价值的任何建议？

Answer 1

只需使用正则表达式即可抓取'Pages: '之后的整数。

import re
print(re.findall(r'^Pages:\s+(\d+)', info.stdout.read().decode('utf-8'), flags=re.MULTILINE)[0])

从子流程输出中获取价值

1 个答案: