我想要一个python函数,它接受pdf并返回文档中注释注释的文本列表。我看过python-poppler(https://code.launchpad.net/~poppler-python/poppler-python/trunk),但我无法弄清楚如何让它给我任何有用的东西。
我找到了get_annot_mapping
方法并修改了提供的演示程序以通过self.current_page.get_annot_mapping()
调用它,但我不知道如何处理AnnotMapping对象。它似乎没有完全实现,只提供复制方法。
如果有任何其他库提供此功能,那也没关系。
答案 0 :(得分:17)
以防万一有人正在寻找一些有效的代码。 这是我使用的脚本。
import poppler
import sys
import urllib
import os
def main():
input_filename = sys.argv[1]
# http://blog.hartwork.org/?p=612
document = poppler.document_new_from_file('file://%s' % \
urllib.pathname2url(os.path.abspath(input_filename)), None)
n_pages = document.get_n_pages()
all_annots = 0
for i in range(n_pages):
page = document.get_page(i)
annot_mappings = page.get_annot_mapping ()
num_annots = len(annot_mappings)
if num_annots > 0:
for annot_mapping in annot_mappings:
if annot_mapping.annot.get_annot_type().value_name != 'POPPLER_ANNOT_LINK':
all_annots += 1
print 'page: {0:3}, {1:10}, type: {2:10}, content: {3}'.format(i+1, annot_mapping.annot.get_modified(), annot_mapping.annot.get_annot_type().value_nick, annot_mapping.annot.get_contents())
if all_annots > 0:
print str(all_annots) + " annotation(s) found"
else:
print "no annotations found"
if __name__ == "__main__":
main()
答案 1 :(得分:3)
原来绑定不完整。它现在已修复。 https://bugs.launchpad.net/poppler-python/+bug/397850
答案 2 :(得分:3)
这是一个工作示例(从先前的answer移植),它使用python模块popplerqt5:python3 extract.py sample.pdf
import popplerqt5
import argparse
def extract(fn):
doc = popplerqt5.Poppler.Document.load(fn)
annotations = []
for i in range(doc.numPages()):
page = doc.page(i)
for annot in page.annotations():
contents = annot.contents()
if contents:
annotations.append(contents)
print(f'page={i + 1} {contents}')
print(f'{len(annotations)} annotation(s) found')
return annotations
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('fn')
args = parser.parse_args()
extract(args.fn)
答案 3 :(得分:3)
pdf-annots脚本可以从PDF提取注释。它建立在PDFMineer.six的基础上,并在markdown中为突出显示的文本和在其上进行的任何注释(例如,对突出显示的区域或弹出框的注释)产生输出。输出将类似于以下内容:
* Page 2 Highlight:
> Underlying text that was highlighted
Comment made on highlighted text.
* Page 3 Highlight: "Short highlighted text" -- Short comment.
* Page 4 Text: A note on the page.
完整的命令选项可以在下面看到。
usage: pdfannots.py [-h] [-p] [-o OUTFILE] [-n COLS] [-s [SEC [SEC ...]]] [--no-group]
[--print-filename] [-w COLS]
INFILE [INFILE ...]
Extracts annotations from a PDF file in markdown format for use in reviewing.
positional arguments:
INFILE PDF files to process
optional arguments:
-h, --help show this help message and exit
Basic options:
-p, --progress emit progress information
-o OUTFILE output file (default is stdout)
-n COLS, --cols COLS number of columns per page in the document (default: 2)
Options controlling output format:
-s [SEC [SEC ...]], --sections [SEC [SEC ...]]
sections to emit (default: highlights, comments, nits)
--no-group emit annotations in order, don't group into sections
--print-filename print the filename when it has annotations
-w COLS, --wrap COLS wrap text at this many output columns
我还没有进行广泛的尝试,但是到目前为止它运行良好!
答案 4 :(得分:1)
我从未使用过这个,也不想要这种功能,但我发现PDFMiner - 此链接包含有关基本用法的信息,也许这就是您要找的内容?
答案 5 :(得分:1)
有人问similar question。我在那里尝试了代码示例,直到我做了一些功能和外观改变之后它才对我起作用。
#!/usr/bin/ruby
require 'pdf-reader'
ARGV.each do |filename|
PDF::Reader.open(filename) do |reader|
puts "file: #{filename}"
puts "page\tcomment"
reader.pages.each do |page|
annots_ref = page.attributes[:Annots]
if annots_ref
actual_annots = annots_ref.map { |a| reader.objects[a] }
actual_annots.each do |actual_annot|
unless actual_annot[:Contents].nil?
puts "#{page.number}\t#{actual_annot[:Contents]}"
end
end
end
end
end
end
如果保存为pdfannot.rb
,chmod +x
并将其放入您最喜爱的PATH
目录,则使用方式为:
./pdfannot.rb <path>
第一次编写/编辑/重新混合Ruby代码,所以非常开放的建议。 HTH。
另一方面,早些发现这个问题可以让我免于双重工作。希望这个问题在未来得到更多关注,以便更容易找到。
答案 6 :(得分:1)
您应该彻底看看PyPDF2
。这个惊人的库具有无限的潜力,您可以从PDF中提取任何内容,包括图像或注释。首先尝试检查Acrobat Reader DC(阅读器)可以为您提供PDF注释的功能。制作一个简单的PDF,使用Reader对其进行批注(添加一些注释),然后在右上角的注释选项卡中,单击水平的三个点,然后单击Export All To Data File...
,然后选择扩展名为xfdf
的格式。这将创建一个精彩的xml文件,您可以对其进行解析。该格式非常透明且不言而喻。
但是,如果您不能依靠用户单击它,而是需要使用python以编程方式从PDF提取相同的数据,请不要绝望,这是一种解决方案。 (受Extract images from PDF without resampling, in python?的启发)
先决条件:
PyPDF2(pip install PyPDF2
)
在上面提到的xfdf文件中,Reader给您的内容如下:
<?xml version="1.0" ?>
<xfdf xml:space="preserve" xmlns="http://ns.adobe.com/xfdf/">
<annots>
<caret IT="Replace" color="#0000FF" creationdate="D:20190221151519+01'00'" date="D:20190221151526+01'00'" flags="print" fringe="1.069520,1.069520,1.069520,1.069520" name="72f8d1b7-d878-4281-bd33-3a6fb4578673" page="0" rect="636.942000,476.891000,652.693000,489.725000" subject="Inserted Text" title="Admin">
<contents-richtext>
<body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
<p dir="ltr">
<span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"> comment1</span>
</p>
</body>
</contents-richtext>
<popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,374.656000,941.008000,488.656000"/>
</caret>
<highlight color="#FFD100" coords="183.867000,402.332000,220.968000,402.332000,183.867000,387.587000,220.968000,387.587000" creationdate="D:20190221151441+01'00'" date="D:20190221151448+01'00'" flags="print" name="a18c7fb0-0af3-435e-8c32-1af2af3c46ea" opacity="0.399994" page="0" rect="179.930000,387.126000,224.904000,402.793000" subject="Highlight" title="Admin">
<contents-richtext>
<body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
<p dir="ltr">
<span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span>
</p>
</body>
</contents-richtext>
<popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,288.332000,941.008000,402.332000"/>
</highlight>
<caret color="#0000FF" creationdate="D:20190221151452+01'00'" date="D:20190221151452+01'00'" flags="print" fringe="0.828156,0.828156,0.828156,0.828156" name="6bf0226e-a3fb-49bf-bc89-05bb671e1627" page="0" rect="285.877000,372.978000,298.073000,382.916000" subject="Inserted Text" title="Admin">
<popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,268.088000,941.008000,382.088000"/>
</caret>
<strikeout IT="StrikeOutTextEdit" color="#0000FF" coords="588.088000,497.406000,644.818000,497.406000,588.088000,477.960000,644.818000,477.960000" creationdate="D:20190221151519+01'00'" date="D:20190221151519+01'00'" flags="print" inreplyto="72f8d1b7-d878-4281-bd33-3a6fb4578673" name="6686b852-3924-4252-af21-c1b10390841f" page="0" rect="582.290000,476.745000,650.616000,498.621000" replyType="group" subject="Cross-Out" title="Admin">
<popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,383.406000,941.008000,497.406000"/>
</strikeout>
</annots>
<f href="p1.pdf"/>
<ids modified="ABB10FA107DAAA47822FB5D311112349" original="474F087D87E7E544F6DEB9E0A93ADFB2"/>
</xfdf>
各种注释类型在此处<annots>
内以标签形式显示。 Python可以为您提供几乎相同的数据。要获得它,请看以下脚本的输出给出了什么:
import sys
import PyPDF2, traceback
try :
src = sys.argv[1]
except :
src = r'/path/to/my/file.pdf'
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
for i in range(nPages) :
page0 = input1.getPage(i)
try :
for annot in page0['/Annots'] :
print annot.getObject() # (1)
print ''
except :
# there are no annotations on this page
pass
与上述xfdf文件中相同文件的输出如下所示:
{'/Popup': IndirectObject(192, 0), '/M': u"D:20190221151448+01'00'", '/CreationDate': u"D:20190221151441+01'00'", '/NM': u'a18c7fb0-0af3-435e-8c32-1af2af3c46ea', '/F': 4, '/C': [1, 0.81961, 0], '/Rect': [179.93, 387.126, 224.904, 402.793], '/Type': '/Annot', '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u'otrasneho', '/QuadPoints': [183.867, 402.332, 220.968, 402.332, 183.867, 387.587, 220.968, 387.587], '/Subj': u'Highlight', '/CA': 0.39999, '/AP': {'/N': IndirectObject(202, 0)}, '/Subtype': '/Highlight'}
{'/Parent': IndirectObject(191, 0), '/Rect': [737.008, 288.332, 941.008, 402.332], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A425D0>, '/Subtype': '/Popup'}
{'/Popup': IndirectObject(194, 0), '/M': u"D:20190221151452+01'00'", '/CreationDate': u"D:20190221151452+01'00'", '/NM': u'6bf0226e-a3fb-49bf-bc89-05bb671e1627', '/F': 4, '/C': [0, 0, 1], '/Subj': u'Inserted Text', '/Rect': [285.877, 372.978, 298.073, 382.916], '/Type': '/Annot', '/P': IndirectObject(5, 0), '/AP': {'/N': IndirectObject(201, 0)}, '/RD': [0.82816, 0.82816, 0.82816, 0.82816], '/T': u'Admin', '/Subtype': '/Caret'}
{'/Parent': IndirectObject(193, 0), '/Rect': [737.008, 268.088, 941.008, 382.088], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42830>, '/Subtype': '/Popup'}
{'/Popup': IndirectObject(196, 0), '/M': u"D:20190221151519+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'6686b852-3924-4252-af21-c1b10390841f', '/F': 4, '/IRT': IndirectObject(197, 0), '/C': [0, 0, 1], '/Rect': [582.29, 476.745, 650.616, 498.621], '/Type': '/Annot', '/T': u'Admin', '/P': IndirectObject(5, 0), '/QuadPoints': [588.088, 497.406, 644.818, 497.406, 588.088, 477.96, 644.818, 477.96], '/Subj': u'Cross-Out', '/IT': '/StrikeOutTextEdit', '/AP': {'/N': IndirectObject(200, 0)}, '/RT': '/Group', '/Subtype': '/StrikeOut'}
{'/Parent': IndirectObject(195, 0), '/Rect': [737.008, 383.406, 941.008, 497.406], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AF0>, '/Subtype': '/Popup'}
{'/Popup': IndirectObject(198, 0), '/M': u"D:20190221151526+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'72f8d1b7-d878-4281-bd33-3a6fb4578673', '/F': 4, '/C': [0, 0, 1], '/Rect': [636.942, 476.891, 652.693, 489.725], '/Type': '/Annot', '/RD': [1.06952, 1.06952, 1.06952, 1.06952], '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment1</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u' pica', '/Subj': u'Inserted Text', '/IT': '/Replace', '/AP': {'/N': IndirectObject(212, 0)}, '/Subtype': '/Caret'}
{'/Parent': IndirectObject(197, 0), '/Rect': [737.008, 374.656, 941.008, 488.656], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AB0>, '/Subtype': '/Popup'}
如果检查输出,您将意识到输出几乎相同。 xfdf文件中的每个注释在PyPDF2的python输出中都有两个对应的注释。 /C
属性是突出显示的颜色(以RGB为单位),缩放为浮动范围<0,1>。 /Rect
定义页面/跨页上的注释的边界框,以相对于页面左下角的点(1/72英寸)为单位,向右和向上增加值。 /M
和/CreationDate
被修改,创建时间被修改,/QuadPoints
是注释[x1, y1, x2, y2, ..., xn, yn]
,/Subject
周围一行的/Type
个坐标的数组,/SubType
,/IT
标识注释的类型,/T
可能是创建者,/RC
是注释文本的xhtml表示(如果有)。如果有墨迹注释,则在此处将其表示为属性/InkList
,其中行1,行2,...,行m的数据格式为[[L1x1, L1y1, L1x2, L1y2, ..., L1xn, L1yn], [L2x1, L2y1, ..., L2xn, L2yn], ..., [Lmx1, Lmy1, ..., Lmxn, Lmyn]]
。
要详细了解从getObject()
到第(1)行的给定python代码中从retrieve
获得的各个字段,请查阅https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf,尤其是从第381页开始的12.5节注释。 413。
答案 7 :(得分:1)
PyMuPDF
的作者@JorjMcKie为我写了一段代码,我做了一些修改:
import fitz # to import the PyMuPDF library
# from pprint import pprint
def _parse_highlight(annot: fitz.Annot, wordlist: list) -> str:
points = annot.vertices
quad_count = int(len(points) / 4)
sentences = ['' for i in range(quad_count)]
for i in range(quad_count):
r = fitz.Quad(points[i * 4: i * 4 + 4]).rect
words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
sentences[i] = ' '.join(w[4] for w in words)
sentence = ' '.join(sentences)
return sentence
def main() -> dict:
doc = fitz.open('path/to/your/file')
page = doc[0]
wordlist = page.getText("words") # list of words on page
wordlist.sort(key=lambda w: (w[3], w[0])) # ascending y, then x
highlights = {}
annot = page.firstAnnot
i = 0
while annot:
if annot.type[0] == 8:
highlights[i] = _parse_highlight(annot, wordlist)
i += 1
print('> ' + highlights[i] + '\n')
annot = annot.next
# pprint(highlights)
return highlights
if __name__ == "__main__":
main()
尽管结果中仍然有一些小的错别字:
> system upsets,
> expansion of smart grid monitoring devices that generally provide nodal voltages and power injections at fine spatial resolution,
> hurricanes to indi- vidual lightning strikes),