HTML :: TokeParser查找标签_until_某个标签

时间:2011-06-28 15:21:33

标签: perl html-parsing

我需要读取一个html文件并查找某个段落标记,其中包含特定文本。一旦找到该标签,我就会希望所有下一个标签中的文字直到我找到一个表格标签

示例:

<asdf>
</asdf>
<p>THE SIGNAL TO GET INFO</p>
    <something>some good stuff in here</something>
<p>something else</p>
<ul>
    <li>something good in here for sure</li>
    <li>this too</li>
</ul>
<table> I DON'T WANT THIS </table>

我可以找到HTML :: TokeParser的第一个Paragraph标记,如下所示:

my $description = "";
my $tp = HTML::TokeParser->new(\$content) || die "Can't open: $!";

while (my $token = $tp->get_tag("p")) {
    my $paragraph = $tp->get_trimmed_text("/p");
    if ($paragraph =~ /On this page/) {
        until ((my $stop = $tp->get_token)->[1] eq "table") {
            if ( $stop->[0] eq "S" ) {
                print $stop->[0],"\n";
            }
        }
        return $description;
    } 
}

我已经尝试了上面的代码......但是有些东西是绝对错误的,因为它甚至都不会编译。

感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

你可能想要调用$ tp-&gt; get_token,存储数据,直到看到["S", "table"…]

你说你无法让它发挥作用。你能解释为什么/你看到了什么?也许为人们提供了一个完整的例子。

好吧,你没有提供示例输出,所以我做了一些假设。

#!/usr/bin/perl
use HTML::TokeParser;

my $content = "<asdf>
</asdf>
<p>THE SIGNAL TO GET INFO</p>
    <something>some good stuff in here</something>
<p>something else</p>
<ul>
    <li>something good in here for sure</li>
    <li>this too</li>
</ul>
<table> I DON'T WANT THIS </table>
";

my $description = "";
my $tp = HTML::TokeParser->new(\$content) || die "Can't open: $!";

while (my $token = $tp->get_tag("p")) {
    my $paragraph = $tp->get_trimmed_text("/p");
    if ($paragraph =~ /THE SIGNAL TO GET INFO/) {
      while (my $toke = $tp->get_token)
      {
        last if ($toke->[1] eq "table");
#       print "<$toke->[0]> <$toke->[1]> <$toke->[2]> <$toke->[3]> <$toke->[4]>\n";
#       print " <".join("><",@{$toke->[3]}).">\n";
        if ($toke->[0] eq "T" ) {
                my $text = $toke->[1];
                $description .= $text;
        }
      }
      print $description;
      last;
    }
}

产地:

    some good stuff in here
something else

    something good in here for sure
    this too