我想从ask提取URL信息。 com
这是标签
<p class="PartialSearchResults-item-url">maps.google.com </p>
这是我尝试过的代码,但它是用它来提取垃圾信息的。
$p = HTML::TokeParser->new(\$rrs);
while ($p->get_tag("p")) {
my @link = $p->get_trimmed_text("/p");
foreach(@link) { print "$_\n"; }
open(OUT, ">>askurls.txt"); print OUT "@link\n"; close(OUT);
}
我只想要域网址,例如maps.google.com
但是它正在提取其中的Source,Images和所有其他p类信息,用不相关的信息填充askurls.txt
添加:
askurls.txt filled with this information:
Videos
Change Settings
OK
Sites Google
Sites Google.com Br
Google
Cookie Policy
assistant.google.com
Meet your Google Assistant. Ask it questions. Tell it to do things. It's your own personal Google, always ready to help whenever you need it.
www.google.com/drive
Safely store and share your photos, videos, files and more in the cloud. Your first 15 GB of storage are free with a Google account.
translate.google.com
Google's free service instantly translates words, phrases, and web pages between English and over 100 other languages.
duo.google.com
答案 0 :(得分:4)
您可以使用一个简单的正则表达式来解析您想要的内容
use strict;
use warnings;
my $text = <<'HTML'; # we are creating example data using a heredoc
<p class="PartialSearchResults-item-url"> maps.google.com </p>
<p class="PartialSearchResults-item-url">example.com</p>
HTML
while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) { # while loop to check all the existing match for the regex
print $1."\n";
}
如果您不确定域中的标记中是否存在空格
(例如此处<p class="PartialSearchResults-item-url">maps.google.com </p>
)
您可以像这样使用\s*
:
m/class="PartialSearchResults-item-url">\s*(.*?)\s*<\/p>/g # here we are checking if there is space before and after the url
如果要检查域是否有效,可以使用is_domain()
模块中的Data::Validate::Domain
:
# previous script
use Data::Validate::Domain qw(is_domain);
while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) {
if (is_domain($1)) {
print $1."\n";
}
}