使用Perl HTML :: TreeBuilder :: Xpath模块获取HTML标记

时间:2016-04-13 12:36:41

标签: perl

这是我创建的一个脚本,以便我可以使用HTML :: TreeBuilder :: XPath模块获取特定新闻文章的内容来创建XML文件。但是,在使用findvalue方法时,我无法获取html标记。

代码

#!/usr/bin/perl -w             

use HTML::LinkExtor;           
use LWP::Simple;
use HTML::TreeBuilder::XPath;  
use Term::ProgressBar;


my $url = "http://www.totalpolitics.com/blog/463546/senior-medics-call-on-uk-to-stay-but-brexiteers-have-more-to-cheer.thtml";
my $content = get $url;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content); 

my $title = $tree->findvalue(q{//div[@id="article"]/h1});
my $body = $tree->findvalue(q{//div[@class="article-body"]});
my $author = $tree->findnodes(q{//div[@class="article-body"]/p/strong});
$author = $author->[0]->getValue;

$body =~ s/$author//;

my $xml .= '<?xml version="1.0" encoding="UTF-8" ?>';
$xml .= '<nodes>';
$xml .= '<node>';
$xml .= '<url>';
$xml .= $url;
$xml .= '</url>';
$xml .= '<title>';
$xml .= $title;
$xml .= '</title>';
$xml .= '<description>';
$xml .= "<![CDATA[$body]]>";
$xml .= '</description>';
$xml .= '<author>';
$xml .= $author;
$xml .= '</author>';
$xml .= "</node>\n";
$xml .= "</nodes>";

print $xml;

发生了什么事?

说明字段中没有“<p>”标记。

<?xml version="1.0" encoding="UTF-8"?>
<nodes>
   <node>
      <url>http://www.totalpolitics.com/blog/463546/senior-medics-call-on-uk-to-stay-but-brexiteers-have-more-to-cheer.thtml</url>
      <title>Senior medics call on UK to stay - but Brexiteers have more to cheer</title>
      <description><![CDATA[The referendum battle steps up today with the Remain camp offering up a consortium of medics – but other interventions suggest the Brexiteers may have more to cheer in the weeks to come.A group of 188 clinicians, academics and public health leaders have written to the Times claiming the NHS would be in jeopardy if the UK were to leave the EU, losing access to “finances, staffing and exchanges”. They added:“As health professionals and researchers we write to highlight the valuable benefits of continued EU membership to the NHS, medical innovation and UK public health. We have made enormous progress over decades in international health research, health services innovation and public health. Much of this is built around shared policies and capacity across the EU.”Britain Stronger In’s decision to use medics is reflective of recent polling by Ipsos Mori, which shows 89% trust doctors to tell the truth, making them the most trusted profession in the UK. Politicians languish at the bottom of the league table on 21%.However the intervention may pale into insignificance after Brexiteers won an important battle to reveal the true number of European migrants working in Britain. The number of migrants with active national insurance (NI) numbers will now be released just weeks before the referendum.According to existing data, about 800,000 EU migrants have moved to the UK in the past four years. However over the same period about 2m EU migrants have been issued with NI numbers.Campaigners have long been calling upon the government to release figures for the number of people with active NI numbers, which they say will provide a more accurate gauge of migration levels than existing official figures.Andrew Tyrie, chairman of the Treasury select committee, said: “This has been obtained as a result of a good deal of persistence … Late, but a good deal better than never. I recognise that HMRC may have encountered some difficulties. So I am glad that they have found a way of resolving them.”The decision comes as the Sun reports an extraordinary outburst from David Cameron as he returned from Washington this weekend.Asked whether the prime minister was so distracted by the referendum campaign that he had taken his eye off the ball, leading to a poorly-received budget and the crisis in the steel industry, he replied:“I think you all spend too much time looking at each other’s newspapers. The world hasn't stopped turning, the Government hasn't stopped operating. You all go around setting each others' hair on fire and getting very excited about this, but it's all a lot of processology.”But new polling out today suggests the Leave campaign has the more compelling set of arguments over Europe. The Fabian Society research shows how an initial four-point leave for Remain among likely voters turns into a two-point lead for Leave once people heard both sides of the story.Research found that the two most important issues to deciding how people vote in the referendum are immigration and controlling our laws – and on both issues the public are far more convinced that leaving the EU will help solve the problem.@Tom_SmithardTotal Politics has a free weekly Friday email bulletin. Follow this link to register.]]></description>
      <author>Photo: Lynne Cameron / PA Wire / Press Association Images</author>
   </node>
</nodes>

我希望在XML文件中看到什么?

我希望在说明文件中包含“<p>”。

1 个答案:

答案 0 :(得分:1)

findvalue()方法返回XPath查询找到的节点的文本。所以,是的,我希望它能够删除所有的XML标签。

你想要的是findnodes_as_string()方法。

my $body = $tree->findnodes_as_string(q{//div[@class="article-body"]});