我需要读取一个html文件并查找某个段落标记,其中包含特定文本。一旦找到该标签,我就会希望所有下一个标签中的文字直到我找到一个表格标签
示例:
<asdf>
</asdf>
<p>THE SIGNAL TO GET INFO</p>
<something>some good stuff in here</something>
<p>something else</p>
<ul>
<li>something good in here for sure</li>
<li>this too</li>
</ul>
<table> I DON'T WANT THIS </table>
我可以找到HTML :: TokeParser的第一个Paragraph标记,如下所示:
my $description = "";
my $tp = HTML::TokeParser->new(\$content) || die "Can't open: $!";
while (my $token = $tp->get_tag("p")) {
my $paragraph = $tp->get_trimmed_text("/p");
if ($paragraph =~ /On this page/) {
until ((my $stop = $tp->get_token)->[1] eq "table") {
if ( $stop->[0] eq "S" ) {
print $stop->[0],"\n";
}
}
return $description;
}
}
我已经尝试了上面的代码......但是有些东西是绝对错误的,因为它甚至都不会编译。
感谢您的帮助。
答案 0 :(得分:1)
你可能想要调用$ tp-&gt; get_token,存储数据,直到看到["S", "table"…]
你说你无法让它发挥作用。你能解释为什么/你看到了什么?也许为人们提供了一个完整的例子。
好吧,你没有提供示例输出,所以我做了一些假设。
#!/usr/bin/perl
use HTML::TokeParser;
my $content = "<asdf>
</asdf>
<p>THE SIGNAL TO GET INFO</p>
<something>some good stuff in here</something>
<p>something else</p>
<ul>
<li>something good in here for sure</li>
<li>this too</li>
</ul>
<table> I DON'T WANT THIS </table>
";
my $description = "";
my $tp = HTML::TokeParser->new(\$content) || die "Can't open: $!";
while (my $token = $tp->get_tag("p")) {
my $paragraph = $tp->get_trimmed_text("/p");
if ($paragraph =~ /THE SIGNAL TO GET INFO/) {
while (my $toke = $tp->get_token)
{
last if ($toke->[1] eq "table");
# print "<$toke->[0]> <$toke->[1]> <$toke->[2]> <$toke->[3]> <$toke->[4]>\n";
# print " <".join("><",@{$toke->[3]}).">\n";
if ($toke->[0] eq "T" ) {
my $text = $toke->[1];
$description .= $text;
}
}
print $description;
last;
}
}
产地:
some good stuff in here
something else
something good in here for sure
this too