在PHP中从XML内部解析HTML标记

时间:2013-07-09 14:36:07

标签: php xml-parsing simplexml

在PHP中解析simplexml_load_string时,我正在尝试使用http://uk.news.yahoo.com/rss创建自己的RSS提要(学习目的)。我被困在阅读<description>标记内的HTML标记。

到目前为止我的代码看起来像这样:

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);

//for each element in the feed
foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

             //how to read the href from the a tag???

             //this does not work at all
             $tags = $item->xpath('//a');
             foreach ($tags as $tag) {
                 echo $tag['href'];
             }
       }
}

如何提取每个HTML标记的任何想法?

由于

3 个答案:

答案 0 :(得分:3)

描述内容对其特殊字符进行了编码,因此它不被视为XML中的节点,而只是一个字符串。您可以解码特殊字符,然后将HTML加载到DOMDocument中并执行您想要执行的任何操作。例如:

foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

            $dom = new DOMDocument();
            $dom->loadHTML(htmlspecialchars_decode((string)$desc));

            $anchors = $dom->getElementsByTagName('a');
            echo $anchors->item(0)->getAttribute('href');
        }
}

XPath也可用于DOMDocument,请参阅DOMXPath

答案 1 :(得分:1)

RSS Feed的<description>元素包含HTML。与How to parse CDATA HTML-content of XML using SimpleXML?中概述的一样,您需要获取该元素的节点值(HTML)并在附加解析器中解析它。

accepted answer to the linked question已经显示了这个非常详细的信息,对于SimpleXML来说,无论RSS源是使用CDATA还是只是像你的情况那样的实体,它在这里都不起作用。

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss  = simplexml_load_string($feed);
$dom  = new DOMDocument(); // the HTML parser used for descriptions' HTML

foreach ($rss->channel->item as $item)
{
    echo '<h3>' . $item->title . '</h3>', "\n";

    foreach ($item->description as $desc)
    {
        $dom->loadHTML($desc);

        $html = simplexml_import_dom($dom)->body;

        echo $html->p->a['href'], "\n";
    }
}

示例性输出:

...
<h3>Chantal nears hurricane strength in Caribbean</h3>
http://uk.news.yahoo.com/chantal-nears-hurricane-strength-caribbean-220149771.html
<h3>Placido Domingo In Hospital With Blood Clot</h3>
http://uk.news.yahoo.com/placido-domingo-hospital-blood-clot-215427742.html
<h3>Berlusconi's final tax fraud appeal hearing set for July 30</h3>
http://uk.news.yahoo.com/berlusconis-final-tax-fraud-appeal-hearing-set-july-214714122.html
<h3>China: Men Rescued From River Amid Floods</h3>
http://uk.news.yahoo.com/china-men-rescued-river-amid-floods-213005159.html
<h3>Snowden has not yet accepted asylum in Venezuela - WikiLeaks</h3>
http://uk.news.yahoo.com/snowden-not-yet-accepted-asylum-venezuela-wikileaks-190332291.html
<h3>Three US kidnap victims break silence</h3>
http://uk.news.yahoo.com/three-us-kidnap-victims-release-thankyou-video-093832611.html
...

希望这会有所帮助。与接受的答案相反,我认为没有理由申请htmlspecialchars_decode,实际上我很确定这会破坏事情。另外,我的例子展示了如何通过展示如何在解析HTML后将DOMNode转换回SimpleXMLElement来保持SimpleXML访问其他子节点的方式。

答案 2 :(得分:0)

这里最好的办法是在$ item上使用var_dump()函数。

feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);
foreach ($rss->channel->item as $item) {
    var_dump($item);
    exit;
}

一旦你这样做,你就会看到你所追求的价值被称为&#34;链接&#34;。因此,要打印出URL,您将使用以下代码:

echo $item->link;