Question

以下代码是an example

的缩写版HTML::Parser

#!/usr/bin/perl -w
use strict;
my $code = shift || usage();
sub edit_print { local $_ = shift; tr/a-z/n-za-m/; print } 
use HTML::Parser 3.05;
my $p = HTML::Parser->new(unbroken_text => 1,
     default_h => [ sub { print @_; }, "text" ],
     text_h    => [ \&edit_print,      "text" ],
);
my $file = shift;
$p->parse_file($file)

此代码运行良好，但它的缺点是它还会重写<script>和<head>部分中的文本。我已经调整了上面的例子来做我想要的，但遗憾的是还有一个错误，它重写了<title>标签内的文字，我不想重写。

有没有人知道如何编写类似上面的内容，但不会破坏JavaScript，<title>或其他部分？如果有必要，我很乐意使用除HTML :: Parser之外的其他模块。

Answer 1

向解析器添加开始和结束处理程序，并让它们记录当前元素的祖先。当祖先包含<head>或<script>时，请禁用重写。

保持头痛

#! /usr/bin/perl

use warnings;
use strict;

use HTML::Parser 3.05;

sub edit_print { local $_ = shift; tr/a-z/n-za-m/; print }

并使用以下子来创建新的解析器：

sub create_parser {
  my @tags;
  my $start = sub {
    my($text,$tagname) = @_;
    push @tags => $tagname;
    print $text;
  };
  my $end = sub {
    my($text,$tagname) = @_;
    die "$0: expected </$tags[-1]>, got </$tagname>"
      unless $tagname eq $tags[-1];
    pop @tags;
    print $text;
  };
  my $edit_print = sub {
    if (grep /^(head|script)$/, @tags) { print @_ }
    else                               { edit_print @_ }
  };

  HTML::Parser->new(
    unbroken_text => 1,
    default_h     => [ sub { print @_ }, "text" ],
    text_h        => [ $edit_print,      "text" ],
    start_h       => [ $start,           "text,tagname" ],
    end_h         => [ $end,             "text,tagname" ],
  );
}

在sub中创建它的原因是@tags中的处理程序回调是closures that share private state。此实现允许您实例化多个解析器，而不必担心它们会互相攻击彼此的数据。

my $p = create_parser;
$p->parse_file(\*DATA);

__DATA__
foo
<html>
<head>
<title>My Title</title>
<style type="text/css">
  /* don't change me */
</style>
</head>
<body>
<script type="text/javascript">
  // or me
</script>
<h1>My Document</h1>
<p>Yo.</p>
</body>
</html>

输出：

sbb
<html>
<head>
<title>My Title</title>
<style type="text/css">
  /* don't change me */
</style>
</head>
<body>
<script type="text/javascript">
  // or me
</script>
<h1>Ml Dbphzrag</h1>
<p>Yb.</p>
</body>
</html>

Answer 2

查看您现有的代码，我不确定您被困在哪里：

添加一堆布尔
```
my @do_edit = (0)
```
，如果$ do_edit [0]为0则不要编辑
将start_h和end_h处理程序添加到某些元素名称的shift / unshift值

如何在不改变<script>和<head>部分的情况下使用HTML :: Parser重写HTML的文本部分？</script>

2 个答案: