我希望能够检测PDF中的模式并以某种方式标记它。
例如,在this PDF中,有字符串*2
。我希望能够解析PDF,检测*[integer]
的所有实例,并做一些事情来引起对匹配的注意(比如突出显示黄色或在边距中添加符号)。
我更喜欢在Python中这样做,但我对其他语言持开放态度。到目前为止,我已经能够使用pyPdf来阅读PDF的文本。我可以使用正则表达式来检测模式。但是我无法弄清楚如何标记匹配并重新保存PDF。
答案 0 :(得分:5)
要么人们不感兴趣,要么Python没有能力,所以这里是Perl的解决方案:-)。说真的,如上所述,你不需要“改变字符串”。 PDF注释是您的解决方案。我不久前有一个带注释的小项目,有些代码来自那里。但是,我的内容解析器不是通用的,你不需要全面的解析 - 这意味着能够改变内容并将其写回。因此我使用了外部工具。我使用的PDF库有些低级,但我不介意。这也意味着,人们应该对PDF内部知识有所了解,以了解正在发生的事情。否则,只需使用该工具。
这是一个标记的镜头,例如OP命令中的所有动名词都带有命令
perl pdf_hl.pl -f westlaw.pdf -p '\S*ing'
代码(内部评论值得一读):
use strict;
use warnings;
use XML::Simple;
use CAM::PDF;
use Getopt::Long;
use Regexp::Assemble;
#####################################################################
#
# This is PDF highlight mark-up tool.
# Though fully functional, it's still a prototype proof-of-concept.
# Please don't feed it with non-pdf files or patterns like '\d*'
# (because you probably want '\d+', don't you?).
#
# Requires muPDF-tools installed and in the PATH, plus some CPAN modules.
#
# ToDo:
# - error handling is primitive if any.
# - cropped files (CropBox) are processed incorrectly. Fix it.
# - of course there can be other useful parameters.
# - allow loading them from file.
# - allow searching across lines (e.g. for multi-word patterns)
# and certainly across "spans" within a line (see mudraw output).
# - multi-color mark-up, not just yellow.
# - control over output file name.
# - compress output (use cleanoutput method instead of output,
# plus more robust (think compressed object streams) compressors
# may be useful).
# - file list processing.
# - annotations are not just colorful marks on the page, their
# dictionaries can contain all sorts of useful information, which may
# be extracted automatically further up the food chain i.e. by
# whoever consumes these files (date, time, author, comments, actual
# text below, etc., etc., plus think of customized appearence streams,
# placing them on layers, etc..
# - ???
#
# Most complexity in the code comes from adding appearance
# dictionary (AP). You can safely delete it, because most viewers don't
# need AP for standard annotations. Ironically, muPDF-viewer wants it
# (otherwise highlight placement is not 100% correct), and since I relied
# on muPDF-tools, I thought it be proper to create PDFs consumable by
# their viewer... Firefox wants AP too, btw.
#
#####################################################################
my ($file, $csv);
my ($c_flag, $w_flag) = (0, 1);
GetOptions('-f=s' => \$file, '-p=s' => \$csv,
'-c!' => \$c_flag, '-w!' => \$w_flag)
and defined($file)
and defined($csv)
or die "\nUsage: perl $0 -f FILE -p LIST -c -w\n\n",
"\t-f\t\tFILE\t PDF file to annotate\n",
"\t-p\t\tLIST\t comma-separated patterns\n",
"\t-c or -noc\t\t be case sensitive (default = no)\n",
"\t-w or -now\t\t whole words only (default = yes)\n";
my $re = Regexp::Assemble->new
->add(split(',', $csv))
->anchor_word($w_flag)
->flags($c_flag ? '' : 'i')
->re;
my $xml = qx/mudraw -ttt $file/;
my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]);
my $pdf = CAM::PDF->new($file);
sub __num_nodes_list {
my $precision = shift;
[ map {CAM::PDF::Node->new('number', sprintf("%.${precision}f", $_))} @_ ]
}
sub add_highlight {
my ($idx, $x1, $y1, $x2, $y2) = @_;
my $p = $pdf->getPage($idx);
# mirror vertically to get to normal cartesian plane
my ($X1, $Y1, $X2, $Y2) = $pdf->getPageDimensions($idx);
($x1, $y1, $x2, $y2) = ($X1 + $x1, $Y2 - $y2, $X1 + $x2, $Y2 - $y1);
# corner radius
my $r = 2;
# AP appearance stream
my $s = "/GS0 gs 1 1 0 rg 1 1 0 RG\n";
$s .= "1 j @{[sprintf '%.0f', $r * 2]} w\n";
$s .= "0 0 @{[sprintf '%.1f', $x2 - $x1]} ";
$s .= "@{[sprintf '%.1f',$y2 - $y1]} re B\n";
my $highlight = CAM::PDF::Node->new('dictionary', {
Subtype => CAM::PDF::Node->new('label', 'Highlight'),
Rect => CAM::PDF::Node->new('array',
__num_nodes_list(1, $x1 - $r, $y1 - $r, $x2 + $r * 2, $y2 + $r * 2)),
QuadPoints => CAM::PDF::Node->new('array',
__num_nodes_list(1, $x1, $y2, $x2, $y2, $x1, $y1, $x2, $y1)),
BS => CAM::PDF::Node->new('dictionary', {
S => CAM::PDF::Node->new('label', 'S'),
W => CAM::PDF::Node->new('number', 0),
}),
Border => CAM::PDF::Node->new('array',
__num_nodes_list(0, 0, 0, 0)),
C => CAM::PDF::Node->new('array',
__num_nodes_list(0, 1, 1, 0)),
AP => CAM::PDF::Node->new('dictionary', {
N => CAM::PDF::Node->new('reference',
$pdf->appendObject(undef,
CAM::PDF::Node->new('object',
CAM::PDF::Node->new('dictionary', {
Subtype => CAM::PDF::Node->new('label', 'Form'),
BBox => CAM::PDF::Node->new('array',
__num_nodes_list(1, -$r, -$r, $x2 - $x1 + $r * 2,
$y2 - $y1 + $r * 2)),
Resources => CAM::PDF::Node->new('dictionary', {
ExtGState => CAM::PDF::Node->new('dictionary', {
GS0 => CAM::PDF::Node->new('dictionary', {
BM => CAM::PDF::Node->new('label',
'Multiply'),
}),
}),
}),
StreamData => CAM::PDF::Node->new('stream', $s),
Length => CAM::PDF::Node->new('number', length $s),
}),
),
,0),
),
}),
});
$p->{Annots} ||= CAM::PDF::Node->new('array', []);
push @{$pdf->getValue($p->{Annots})}, $highlight;
$pdf->{changes}->{$p->{Type}->{objnum}} = 1
}
my $page_index = 1;
for my $page (@{$tree->{page}}) {
for my $block (@{$page->{block}}) {
for my $line (@{$block->{line}}) {
for my $span (@{$line->{span}}) {
my $string = join '', map {$_->{c}} @{$span->{char}};
while ($string =~ /$re/g) {
my ($x1, $y1) =
split ' ', $span->{char}->[$-[0]]->{bbox};
my (undef, undef, $x2, $y2) =
split ' ', $span->{char}->[$+[0] - 1]->{bbox};
add_highlight($page_index, $x1, $y1, $x2, $y2)
}
}
}
}
$page_index ++
}
$pdf->output($file =~ s/(.{4}$)/++$1/r);
__END__
P.S。我用'Perl'标记了问题,可能会从社区获得一些反馈(代码更正等)。
答案 1 :(得分:1)
这不重要。问题是PDF文件并不意味着在任何小于页面的内容上“更新”。您基本上必须解析页面,调整PostScript渲染,然后将其写回。我不认为PyPDF支持做你想做的事。
如果你要做的“全部”是添加突出显示,你可以只使用注释字典。有关详细信息,请参阅PDF specification。
您可以使用pyPDF2执行此操作,但我没有仔细研究过。