如何从Perl中的多个标签中提取准确的信息

时间:2019-06-09 05:33:40

标签: html perl

我想从ask提取URL信息。 com

这是标签

<p class="PartialSearchResults-item-url">maps.google.com </p>

这是我尝试过的代码,但它是用它来提取垃圾信息的。

$p = HTML::TokeParser->new(\$rrs);

while ($p->get_tag("p")) {

    my @link = $p->get_trimmed_text("/p");

     foreach(@link) { print "$_\n"; }

      open(OUT, ">>askurls.txt"); print OUT "@link\n"; close(OUT);

  }

我只想要域网址,例如maps.google.com

但是它正在提取其中的Source,Images和所有其他p类信息,用不相关的信息填充askurls.txt

添加:

askurls.txt filled with this information:
Videos
Change Settings
OK
Sites Google
Sites Google.com Br
Google
Cookie Policy
assistant.google.com
Meet your Google Assistant. Ask it questions. Tell it to do things. It's your own personal Google, always ready to help whenever you need it.
www.google.com/drive
Safely store and share your photos, videos, files and more in the cloud. Your first 15 GB of storage are free with a Google account.
translate.google.com
Google's free service instantly translates words, phrases, and web pages between English and over 100 other languages.
duo.google.com

1 个答案:

答案 0 :(得分:4)

您可以使用一个简单的正则表达式来解析您想要的内容

use strict;
use warnings;

my $text = <<'HTML'; # we are creating example data using a heredoc
<p class="PartialSearchResults-item-url"> maps.google.com </p>
<p class="PartialSearchResults-item-url">example.com</p>
HTML

while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) { # while loop to check all the existing match for the regex
  print $1."\n";
}

如果您不确定域中的标记中是否存在空格

(例如此处<p class="PartialSearchResults-item-url">maps.google.com </p>

您可以像这样使用\s*

m/class="PartialSearchResults-item-url">\s*(.*?)\s*<\/p>/g # here we are checking if there is space before and after the url

如果要检查域是否有效,可以使用is_domain()模块中的Data::Validate::Domain

# previous script
use Data::Validate::Domain qw(is_domain);

while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) {
   if (is_domain($1)) {
      print $1."\n";
   }
}