我想从以下RSS页面解析新闻标题和链接:
http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE
我尝试过使用此代码(但它无效):
<?php
$xml=("http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
然而,相同的代码正在努力从其他RSS页面获取RSS标题和链接..例如:
<?php
$xml=("https://feeds.finance.yahoo.com/rss/2.0/headline?s=bcm.v®ion=US&lang=en-US");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
您对如何使其有效有任何想法吗?
提前致谢!
答案 0 :(得分:1)
当没有设置用户代理时,它们具有安全性,因此您必须使用curl并伪造用户代理来获取xml内容,例如:
$url = "http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$xml = curl_exec($ch);
答案 1 :(得分:1)
问题是您尝试使用String SonucAc = ""+rs_aciklamalar.getString("DATA");
String SonucAc1 = SonucAc.replace("\n", "");
下载远程文档。该方法能够下载远程文件,但如果未通过user_agent
INI设置指定,则不会设置DOMDocument::load
HTTP标头。如果缺少User-Agent
标头,则某些主机配置为拒绝HTTP请求。如果标题丢失,您粘贴到问题中的网址会返回User-Agent
。
因此您应该通过INI设置设置用户代理:
403 Forbidden
或使用ini_set('user_agent', 'MyCrawler/1.0');
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$doc = new DOMDocument();
$doc->load($url);
标题集手动下载文档,例如:
User-Agent
您的代码的下一个问题是您完全依赖于特定的DOM结构:
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'MyCrawler/1.0');
$xml = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadXML($xml);
在许多可能的情况下,您的代码将无法按预期工作:少于5个项目,缺少元素,空文档等。此外,代码不是非常易读。在深入了解其结构之前,您应该始终检查节点是否存在,例如:
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
您可以用类似的方式解析$channels = $doc->getElementsByTagName('channel');
foreach ($channels as $channel) {
// Print channel properties
foreach ($channel->childNodes as $child) {
if ($child->nodeType !== XML_ELEMENT_NODE) {
continue;
}
switch ($child->nodeName) {
case 'title':
echo "Title: ", $child->nodeValue, PHP_EOL;
break;
case 'description':
echo "Description: ", $child->nodeValue, PHP_EOL;
break;
}
}
}
元素:
item