Question

我正在使用TokeParser来提取标签内容。

...
$text = $p->get_text("/td") ;
...

通常它会返回清理的文本。我想要的是返回td和/td之间的覆盖，但包括所有其他html元素。怎么做。

我正在使用this tutorial中的示例。感谢

在示例中，

my( $tag, $attr, $attrseq, $rawtxt) = @{ $token };

我相信$rawtxt有一些诀窍。

Answer 1

HTML :: TokeParser没有内置功能来执行此操作。但是，可以通过单独查看<td>之间的每个标记来实现。

#!/usr/bin/perl
use strictures;
use HTML::TokeParser;
use 5.012;

# dispatch table with subs to handle the different types of tokens
my %dispatch = (
  S  => sub { $_[0]->[4] }, # Start tag
  E  => sub { $_[0]->[2] }, # End tag
  T  => sub { $_[0]->[1] }, # Text
  C  => sub { $_[0]->[1] }, # Comment
  D  => sub { $_[0]->[1] }, # Declaration
  PI => sub { $_[0]->[2] }, # Process Instruction
);

# create the parser
my $p = HTML::TokeParser->new( \*DATA ) or die "Can't open: $!";

# fetch all the <td>s
TD: while ( $p->get_tag('td') ) {
  # go through all tokens ...
  while ( my $token = $p->get_token ) {
    # ... but stop at the end of the current <td>
    next TD if ( $token->[0] eq 'E' && $token->[1] eq 'td' );
    # call the sub corresponding to the current type of token
    print $dispatch{$token->[0]}->($token);
  }
} continue {
  # each time next TD is called, print a newline
  print "\n";  
}

__DATA__
<html><body><table>
<tr>
<td><strong>foo</strong></td>
<td><em>bar</em></td>
<td><font size="10"><font color="#FF0000">frobnication</font></font>
<p>Lorem ipsum dolor set amet fooofooo foo.</p></td>
</tr></table></body></html>

此程序将解析__DATA__部分中的HTML文档，并在<td>和</td>之间打印包括HTML在内的所有内容。它将按<td>打印一行。让我们一步一步地完成它。

在阅读documentation之后，我了解到HTML :: TokeParser中的每个标记都有一个与之关联的类型。共有六种类型：S，E，T，C，D和PI。医生说：
此方法将返回HTML文档中找到的下一个标记，或文件末尾的undef。令牌作为数组返回参考。数组的第一个元素是一个字符串表示此标记的类型：“S”表示开始标记，“E”表示结束标记，“T”表示开始标记文本，“C”表示注释，“D”表示声明，“PI”表示处理说明。令牌数组的其余部分取决于类型这样：
```
["S",  $tag, $attr, $attrseq, $text]
["E",  $tag, $text]
["T",  $text, $is_data]
["C",  $text]
["D",  $text]
["PI", $token0, $text]
```
我们希望访问存储在这些令牌中的$text，因为没有其他方法可以获取看起来像HTML标记的内容。因此，我在%dispatch创建了一个dispatch table来处理它们。它存储了一堆稍后调用的代码引用。
我们从__DATA__阅读了该文档，这对此示例很方便。
首先，我们需要使用<td>方法获取get_tag。 @ nrathaus的评论指出了我的方式。在打开<td>之后，它会将解析器移动到下一个标记。我们不关心get_tag返回的内容，因为我们只需要<td>之后的令牌。
我们使用get_token方法获取下一个令牌并使用它做任务：
- 但我们只想这样做，直到找到相应的结束</td>。如果我们看到这一点，我们next标记为while的外部TD循环。
- 此时，continue block被调用并打印换行符。
- 如果我们不在最后，神奇的事情发生了：调度表;正如我们之前看到的，令牌数组ref中的第一个元素保存了类型。 %dispatch中的每种类型都有代码引用。我们通过转发$token来调用它并传递完整的数组引用$coderef->(@args)。我们在当前行上打印结果。
  
  每次运行都会生成<strong>，foo，</strong>等内容。

请注意，这仅适用于一张桌子。如果表中有一个表（类似<td> ... <td></td> ... </td>），这将会中断。你必须调整它以记住它的深度。

另一种方法是使用miyagawa的优秀Web::Scraper。这样，我们的代码就少了很多：

#!/usr/bin/perl
use strictures;
use Web::Scraper;
use 5.012;

my $s = scraper {
  process "td", "foo[]" => 'HTML'; # grab the raw HTML for all <td>s
  result 'foo'; # return the array foo where the raw HTML is stored
};

my $html = do { local $/ = undef; <DATA> }; # read HTML from __DATA__
my $res = $s->scrape( $html ); # scrape

say for @$res; # print each line of HTML

这种方法也可以像魅力一样处理多维表。

Perl HTML :: Tokeparser在标签之间获取原始html

1 个答案: