在PHP中解析simplexml_load_string
时,我正在尝试使用http://uk.news.yahoo.com/rss
创建自己的RSS提要(学习目的)。我被困在阅读<description>
标记内的HTML标记。
到目前为止我的代码看起来像这样:
$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);
//for each element in the feed
foreach ($rss->channel->item as $item) {
echo '<h3>'. $item->title . '</h3>';
foreach($item->description as $desc){
//how to read the href from the a tag???
//this does not work at all
$tags = $item->xpath('//a');
foreach ($tags as $tag) {
echo $tag['href'];
}
}
}
如何提取每个HTML标记的任何想法?
由于
答案 0 :(得分:3)
描述内容对其特殊字符进行了编码,因此它不被视为XML中的节点,而只是一个字符串。您可以解码特殊字符,然后将HTML加载到DOMDocument中并执行您想要执行的任何操作。例如:
foreach ($rss->channel->item as $item) {
echo '<h3>'. $item->title . '</h3>';
foreach($item->description as $desc){
$dom = new DOMDocument();
$dom->loadHTML(htmlspecialchars_decode((string)$desc));
$anchors = $dom->getElementsByTagName('a');
echo $anchors->item(0)->getAttribute('href');
}
}
XPath也可用于DOMDocument,请参阅DOMXPath。
答案 1 :(得分:1)
RSS Feed的<description>
元素包含HTML。与How to parse CDATA HTML-content of XML using SimpleXML?中概述的一样,您需要获取该元素的节点值(HTML)并在附加解析器中解析它。
accepted answer to the linked question已经显示了这个非常详细的信息,对于SimpleXML来说,无论RSS源是使用CDATA还是只是像你的情况那样的实体,它在这里都不起作用。
$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);
$dom = new DOMDocument(); // the HTML parser used for descriptions' HTML
foreach ($rss->channel->item as $item)
{
echo '<h3>' . $item->title . '</h3>', "\n";
foreach ($item->description as $desc)
{
$dom->loadHTML($desc);
$html = simplexml_import_dom($dom)->body;
echo $html->p->a['href'], "\n";
}
}
示例性输出:
...
<h3>Chantal nears hurricane strength in Caribbean</h3>
http://uk.news.yahoo.com/chantal-nears-hurricane-strength-caribbean-220149771.html
<h3>Placido Domingo In Hospital With Blood Clot</h3>
http://uk.news.yahoo.com/placido-domingo-hospital-blood-clot-215427742.html
<h3>Berlusconi's final tax fraud appeal hearing set for July 30</h3>
http://uk.news.yahoo.com/berlusconis-final-tax-fraud-appeal-hearing-set-july-214714122.html
<h3>China: Men Rescued From River Amid Floods</h3>
http://uk.news.yahoo.com/china-men-rescued-river-amid-floods-213005159.html
<h3>Snowden has not yet accepted asylum in Venezuela - WikiLeaks</h3>
http://uk.news.yahoo.com/snowden-not-yet-accepted-asylum-venezuela-wikileaks-190332291.html
<h3>Three US kidnap victims break silence</h3>
http://uk.news.yahoo.com/three-us-kidnap-victims-release-thankyou-video-093832611.html
...
希望这会有所帮助。与接受的答案相反,我认为没有理由申请htmlspecialchars_decode
,实际上我很确定这会破坏事情。另外,我的例子展示了如何通过展示如何在解析HTML后将DOMNode转换回SimpleXMLElement来保持SimpleXML访问其他子节点的方式。
答案 2 :(得分:0)
这里最好的办法是在$ item上使用var_dump()函数。
feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);
foreach ($rss->channel->item as $item) {
var_dump($item);
exit;
}
一旦你这样做,你就会看到你所追求的价值被称为&#34;链接&#34;。因此,要打印出URL,您将使用以下代码:
echo $item->link;