检测并更改PDF中的字符串

时间:2013-10-16 22:04:43

标签: python regex perl pdf pypdf

我希望能够检测PDF中的模式并以某种方式标记它。

例如,在this PDF中,有字符串*2。我希望能够解析PDF,检测*[integer]的所有实例,并做一些事情来引起对匹配的注意(比如突出显示黄色或在边距中添加符号)。

我更喜欢在Python中这样做,但我对其他语言持开放态度。到目前为止,我已经能够使用pyPdf来阅读PDF的文本。我可以使用正则表达式来检测模式。但是我无法弄清楚如何标记匹配并重新保存PDF。

2 个答案:

答案 0 :(得分:5)

要么人们不感兴趣,要么Python没有能力,所以这里是Perl的解决方案:-)。说真的,如上所述,你不需要“改变字符串”。 PDF注释是您的解决方案。我不久前有一个带注释的小项目,有些代码来自那里。但是,我的内容解析器不是通用的,你不需要全面的解析 - 这意味着能够改变内容并将其写回。因此我使用了外部工具。我使用的PDF库有些低级,但我不介意。这也意味着,人们应该对PDF内部知识有所了解,以了解正在发生的事情。否则,只需使用该工具。

这是一个标记的镜头,例如OP命令中的所有动名词都带有命令

perl pdf_hl.pl -f westlaw.pdf -p '\S*ing'

enter image description here

代码(内部评论值得一读):

use strict;
use warnings;
use XML::Simple;
use CAM::PDF;
use Getopt::Long;
use Regexp::Assemble;

#####################################################################
#
#  This is PDF highlight mark-up tool.
#  Though fully functional, it's still a prototype proof-of-concept.
#  Please don't feed it with non-pdf files or patterns like '\d*' 
#  (because you probably want '\d+', don't you?).
#  
#  Requires muPDF-tools installed and in the PATH, plus some CPAN modules.
#
#  ToDo:
#  - error handling is primitive if any.
#  - cropped files (CropBox) are processed incorrectly. Fix it.
#  - of course there can be other useful parameters.
#  - allow loading them from file.
#  - allow searching across lines (e.g. for multi-word patterns)
#    and certainly across "spans" within a line (see mudraw output).
#  - multi-color mark-up, not just yellow.
#  - control over output file name.
#  - compress output (use cleanoutput method instead of output,
#    plus more robust (think compressed object streams) compressors 
#    may be useful).
#  - file list processing.
#  - annotations are not just colorful marks on the page, their 
#    dictionaries can contain all sorts of useful information, which may 
#    be extracted automatically further up the food chain i.e. by 
#    whoever consumes these files (date, time, author, comments, actual 
#    text below, etc., etc., plus think of customized appearence streams,
#    placing them on layers, etc..
#  - ???
#
#   Most complexity in the code comes from adding appearance 
#   dictionary (AP). You can safely delete it, because most viewers don't 
#   need AP for standard annotations. Ironically, muPDF-viewer wants it 
#   (otherwise highlight placement is not 100% correct), and since I relied 
#   on muPDF-tools, I thought it be proper to create PDFs consumable by 
#   their viewer... Firefox wants AP too, btw.
#
#####################################################################

my ($file, $csv);
my ($c_flag, $w_flag) = (0, 1);
GetOptions('-f=s' => \$file,   '-p=s' => \$csv, 
           '-c!'  => \$c_flag, '-w!'  => \$w_flag) 
    and defined($file)
    and defined($csv)
or die "\nUsage: perl $0 -f FILE -p LIST -c -w\n\n",
       "\t-f\t\tFILE\t PDF file to annotate\n",
       "\t-p\t\tLIST\t comma-separated patterns\n",
       "\t-c or -noc\t\t be case sensitive (default = no)\n",
       "\t-w or -now\t\t whole words only (default = yes)\n";
my $re = Regexp::Assemble->new
    ->add(split(',', $csv))
    ->anchor_word($w_flag)
    ->flags($c_flag ? '' : 'i')
    ->re;
my $xml = qx/mudraw -ttt $file/;
my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]);
my $pdf = CAM::PDF->new($file);

sub __num_nodes_list {
    my $precision = shift;
    [ map {CAM::PDF::Node->new('number', sprintf("%.${precision}f", $_))} @_ ]
}

sub add_highlight {
    my ($idx, $x1, $y1, $x2, $y2) = @_;
    my $p = $pdf->getPage($idx);

    # mirror vertically to get to normal cartesian plane 
    my ($X1, $Y1, $X2, $Y2) = $pdf->getPageDimensions($idx);
    ($x1, $y1, $x2, $y2) = ($X1 + $x1, $Y2 - $y2, $X1 + $x2, $Y2 - $y1);
    # corner radius
    my $r = 2;

    # AP appearance stream
    my $s = "/GS0 gs 1 1 0 rg 1 1 0 RG\n";
    $s .= "1 j @{[sprintf '%.0f', $r * 2]} w\n";
    $s .= "0 0 @{[sprintf '%.1f', $x2 - $x1]} ";
    $s .= "@{[sprintf '%.1f',$y2 - $y1]} re B\n";

    my $highlight = CAM::PDF::Node->new('dictionary', {
        Subtype => CAM::PDF::Node->new('label', 'Highlight'),
        Rect => CAM::PDF::Node->new('array', 
          __num_nodes_list(1, $x1 - $r, $y1 - $r, $x2 + $r * 2, $y2 + $r * 2)),
        QuadPoints => CAM::PDF::Node->new('array', 
            __num_nodes_list(1, $x1, $y2, $x2, $y2, $x1, $y1, $x2, $y1)),
        BS => CAM::PDF::Node->new('dictionary', {
            S => CAM::PDF::Node->new('label', 'S'),
            W => CAM::PDF::Node->new('number', 0),
        }),
        Border => CAM::PDF::Node->new('array', 
            __num_nodes_list(0, 0, 0, 0)),
        C => CAM::PDF::Node->new('array', 
            __num_nodes_list(0, 1, 1, 0)),

        AP => CAM::PDF::Node->new('dictionary', {
            N => CAM::PDF::Node->new('reference', 
                $pdf->appendObject(undef, 
                    CAM::PDF::Node->new('object',
                        CAM::PDF::Node->new('dictionary', {
                            Subtype => CAM::PDF::Node->new('label', 'Form'),
                            BBox => CAM::PDF::Node->new('array',
                              __num_nodes_list(1, -$r, -$r, $x2 - $x1 + $r * 2, 
                                                 $y2 - $y1 + $r * 2)),
                            Resources => CAM::PDF::Node->new('dictionary', {
                                ExtGState => CAM::PDF::Node->new('dictionary', {
                                    GS0 => CAM::PDF::Node->new('dictionary', {
                                        BM => CAM::PDF::Node->new('label', 
                                            'Multiply'),
                                    }),
                                }),
                            }),
                            StreamData => CAM::PDF::Node->new('stream', $s),
                            Length => CAM::PDF::Node->new('number', length $s),
                        }),
                    ),
                ,0),
            ),
        }),
    });

    $p->{Annots} ||= CAM::PDF::Node->new('array', []);
    push @{$pdf->getValue($p->{Annots})}, $highlight;

    $pdf->{changes}->{$p->{Type}->{objnum}} = 1
}

my $page_index = 1;
for my $page (@{$tree->{page}}) {
    for my $block (@{$page->{block}}) {
        for my $line (@{$block->{line}}) {
            for my $span (@{$line->{span}}) {
                my $string = join '', map {$_->{c}} @{$span->{char}};
                while ($string =~ /$re/g) {
                    my ($x1, $y1) = 
                        split ' ', $span->{char}->[$-[0]]->{bbox};
                    my (undef, undef, $x2, $y2) = 
                        split ' ', $span->{char}->[$+[0] - 1]->{bbox};
                    add_highlight($page_index, $x1, $y1, $x2, $y2)
                }
            }
        }
    }
    $page_index ++
}
$pdf->output($file =~ s/(.{4}$)/++$1/r);

__END__

P.S。我用'Perl'标记了问题,可能会从社区获得一些反馈(代码更正等)。

答案 1 :(得分:1)

这不重要。问题是PDF文件并不意味着在任何小于页面的内容上“更新”。您基本上必须解析页面,调整PostScript渲染,然后将其写回。我不认为PyPDF支持做你想做的事。

如果你要做的“全部”是添加突出显示,你可以只使用注释字典。有关详细信息,请参阅PDF specification

您可以使用pyPDF2执行此操作,但我没有仔细研究过。