我正在尝试为HTML网页解析定义为data-expanded-url
的网址。只提取一个网址,只有一个data-expanded-url
。这是网页的主要部分,我正在做的所有事情都在发生:
<p class="js-tweet-text tweet-text">The Air <strong>Jordan</strong> 11 Retro
<strong>Low</strong> '<strong>Nightshade</strong>' is <strong>now available</strong>
<a href="http://tt.co/5w574TicgS" rel="nofollow" dir="ltr" data-expanded-url="http://swoo.sh/1fKCmCB" class="twitter-timeline-link" target="_blank" title="http://swoo.sh/1fKCmCB" >
<span class="tco-ellipsis"></span>
<span class="invisible">http://</span>
<span class="js-display-url">swoo.sh/1fKCmCB</span>
<span class="invisible"></span><span class="tco-ellipsis">
<span class="invisible"> </span></span>
</a>
<a href="http://tt.co/Ug4qjrW9DD" class="twitter-timeline-link" data-pre-embedded="true" dir="ltr" >pic.twitter.com/Ug4qjrW9DD</a>
</p>
这是该部分data-expanded-url
:
<a href="http://tt.co/5w574TicgS" rel="nofollow" dir="ltr" data-expanded-url="http://swoo.sh/1fKCmCB"`
如何使用Mojo::DOM或HTML::Parser或XPath轻松提取data-expanded-url
?
答案 0 :(得分:0)
以下是您需要的代码:
use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
my $mech = WWW::Mechanize->new();
my $url = "https://twitter.com/search?q=from%3Anikestore%20%22jordan%22%20%22nightshade%22%20%22low%22%20%22now%20available%22%20since%3A2014-5-2&src=typd";
$mech->get($url);
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
my ($link) = $tree->findvalues('//p[ @class =~ /\btweet-text\b/ ]/a[1]/@data-expanded-url');
$tree->delete();
print $link, "\n";
答案 1 :(得分:0)
这是一个Mojo::DOM示例:
use Mojo::DOM;
my $html = do { local $/; <DATA> };
my $dom = Mojo::DOM->new( $html );
say $dom
->find( 'a.twitter-timeline-link' )
->map( attr => 'data-expanded-url' )
->grep( sub { defined } )
->join( "\n" );