Question

我正在使用Adobe Acrobat Pro以XML格式从PDF中提取信息。 Acrobat做得特别好。我想从大约一千个文档中提取信息并使用该信息进行处理，因此手动使用Acrobat会很烦人。是否有插件可以从任何通用语言调用Acrobat函数（即保存为XML），理想情况下是Python？

Answer 1

也许你可以看看pypdf？它允许python引用Adobe PDF。另外PDFminer允许pdf xml提取。我知道perl可以做到这一点因为我以前用过它，这里是对模块的引用CAM::PDF

示例：

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(file("document1.pdf", "rb"))

# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)

# add page 1 from input1 to output document, unchanged
output.addPage(input1.getPage(0))

# add page 2 from input1, but rotated clockwise 90 degrees
output.addPage(input1.getPage(1).rotateClockwise(90))

# add page 3 from input1, rotated the other way:
output.addPage(input1.getPage(2).rotateCounterClockwise(90))
# alt: output.addPage(input1.getPage(2).rotateClockwise(270))

# add page 4 from input1, but first add a watermark from another pdf:
page4 = input1.getPage(3)
watermark = PdfFileReader(file("watermark.pdf", "rb"))
page4.mergePage(watermark.getPage(0))

# add page 5 from input1, but crop it to half size:
page5 = input1.getPage(4)
page5.mediaBox.upperRight = (
    page5.mediaBox.getUpperRight_x() / 2,
    page5.mediaBox.getUpperRight_y() / 2
)
output.addPage(page5)

# print how many pages input1 has:
print "document1.pdf has %s pages." % input1.getNumPages()

# finally, write "output" to document-output.pdf
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()

另请看一下这个问题：python and pyPdf - how to extract text from the pages so that there are spaces between lines。在PDF中描述XML解析等。

Answer 2

如果您使用的是Windows，则可以使用DDE命令与Acrobat交谈。 pyWin32模块支持DDE调用，或者您可以尝试使用this独立绑定。

但是你必须弄清楚发送给Acrobat的请求。（here是一些随机文档，但它没有提到XML）。似乎命令从版本变为版本（或者至少有些东西中断），因此请密切关注版本。祝你好运。

用Python运行Acrobat

2 个答案: