解析RSS新闻不起作用

时间:2016-12-23 10:48:55

标签: php xml parsing dom web-scraping

我想从以下RSS页面解析新闻标题和链接:

http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE

我尝试过使用此代码(但它无效):

<?php

$xml=("http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE");

$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
  ->item(0)->childNodes->item(0)->nodeValue;
  $link=$x->item($i)->getElementsByTagName('link')
  ->item(0)->childNodes->item(0)->nodeValue;

  echo $title;
  echo $link;

}
?>

然而,相同的代码正在努力从其他RSS页面获取RSS标题和链接..例如:

<?php

$xml=("https://feeds.finance.yahoo.com/rss/2.0/headline?s=bcm.v&region=US&lang=en-US");

$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
  ->item(0)->childNodes->item(0)->nodeValue;
  $link=$x->item($i)->getElementsByTagName('link')
  ->item(0)->childNodes->item(0)->nodeValue;

  echo $title;
  echo $link;

}
?>

您对如何使其有效有任何想法吗?

提前致谢!

2 个答案:

答案 0 :(得分:1)

当没有设置用户代理时,它们具有安全性,因此您必须使用curl并伪造用户代理来获取xml内容,例如:

$url = "http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$xml = curl_exec($ch);

答案 1 :(得分:1)

下载远程文档

问题是您尝试使用String SonucAc = ""+rs_aciklamalar.getString("DATA"); String SonucAc1 = SonucAc.replace("\n", ""); 下载远程文档。该方法能够下载远程文件,但如果未通过user_agent INI设置指定,则不会设置DOMDocument::load HTTP标头。如果缺少User-Agent标头,则某些主机配置为拒绝HTTP请求。如果标题丢失,您粘贴到问题中的网址会返回User-Agent

因此您应该通过INI设置设置用户代理:

403 Forbidden

或使用ini_set('user_agent', 'MyCrawler/1.0'); $url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE'; $doc = new DOMDocument(); $doc->load($url); 标题集手动下载文档,例如:

User-Agent

遍历DOM

您的代码的下一个问题是您完全依赖于特定的DOM结构:

$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'MyCrawler/1.0');
$xml = curl_exec($ch);

$doc = new DOMDocument();
$doc->loadXML($xml);

在许多可能的情况下,您的代码将无法按预期工作:少于5个项目,缺少元素,空文档等。此外,代码不是非常易读。在深入了解其结构之前,您应该始终检查节点是否存在,例如:

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
    ->item(0)->childNodes->item(0)->nodeValue;

您可以用类似的方式解析$channels = $doc->getElementsByTagName('channel'); foreach ($channels as $channel) { // Print channel properties foreach ($channel->childNodes as $child) { if ($child->nodeType !== XML_ELEMENT_NODE) { continue; } switch ($child->nodeName) { case 'title': echo "Title: ", $child->nodeValue, PHP_EOL; break; case 'description': echo "Description: ", $child->nodeValue, PHP_EOL; break; } } } 元素:

item