如何解析xml标签内的html

时间:2017-08-14 07:24:08

标签: php parsing rss domdocument

我需要帮助从description标记中获取数据,其中包含<a><img>和一些文本。我试图解析的xml是this

我设法获得了我需要的所有数据,description标记除外,其中我得到了<a>标记以及描述文字。我需要的是img的src和描述文本。

我的代码:

foreach ($rss->getElementsByTagName('item') as $node) {
        /*$test = $node->getElementsByTagName('description');
        $test = $test->item(0)->textContent;*/
        var_dump($test);
        exit;
         $nodes = $node->getElementsByTagName('content');


         if(!is_object($nodes) || $nodes === null || $nodes->length==0){

                $linkthumbNode = $node->getElementsByTagName('image');


                if(isset($linkthumbNode) && $linkthumbNode->length >0){
                        $linkthumb=$linkthumbNode->item(0)->nodeValue;

                        if(empty($linkthumb)||$linkthumb == " "){


                            $linkthumb = $linkthumbNode->item(0)->getAttribute('src');

                        }

                    }else{

                        $linkthumb = "NO IMAGE";
                 }

         }else{

             $linkthumb = $nodes->item(0)->getAttribute('url');
         }

         $title = $node->getElementsByTagName('title')->item(0)->nodeValue;
         $desc = $node->getElementsByTagName('description')->item(0)->textContent;
         $link = $node->getElementsByTagName('link')->item(0)->nodeValue;
         $img = $linkthumb;
         $date = $node->getElementsByTagName('pubDate');
         if(isset($date) && $date->length >0){
            $date = $date->item(0)->nodeValue;
         }else{
            $date = "no date provided";

         }


        $item = array ( 
            'title' => $title,
            'desc' =>  $desc,
            'link' => $link,
            'img' => $img,
            'date' => $date,
            );
        array_push($feed, $item);
    }

xml描述标签是:

<description>
<a href="http://timesofindia.indiatimes.com/life-style/health-fitness/diet/9-food-combos-to-make-you-lean/articleshow/20984744.cms"><img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src="http://timesofindia.indiatimes.com/photo/20984744.cms" /></a>Nine food combinations that will make staying healthy and looking fit easier
</description>

我需要的是:http://timesofindia.indiatimes.com/photo/20984744.cms作为图片,Nine food combinations that will make staying healthy and looking fit easier作为我的描述。

有人能帮助我吗?我在PHP和解析XML方面并不是那么出色。

1 个答案:

答案 0 :(得分:0)

也许我迟到了,但如果仍然需要答案,请查看我的解决方案。我使用PHP DOMDocument和正则表达式,因为我还没有找到一种只使用XML扩展来获取所需数据的简单方法。

$rss = file_get_contents('https://timesofindia.indiatimes.com/rssfeeds/2886704.cms');
$feed = new DOMDocument();
$feed->loadXML($rss);

$items = array();

foreach($feed->getElementsByTagName('item') as $item) {
    $arr = array();
    foreach($item->childNodes as $child) {
        if($child->nodeName === 'title' || $child->nodeName === 'link') $arr[$child->nodeName] = $child->nodeValue; 
        if($child->nodeName === 'pubDate') $arr['date'] = $child->nodeValue; 
        if($child->nodeName === 'description') {
            preg_match('/(?<=src=[\'\"])(.+)(?=[\'\"])/i', $child->nodeValue, $matches);
            $arr['img'] = $matches[0];
            preg_match('/[^>]+$/i', $child->nodeValue, $matches);
            $arr['desc'] = $matches[0];
        }
    }
    array_push($items, $arr);
}
print_r($items);

输出是这样的,似乎是你需要的:

Array ( [0] => Array ( [title] => 5 reasons you get sore after sex [img] => https://timesofindia.indiatimes.com/photo/61101815.cms [desc] => Sometimes, a super-filmy, almost-perfect sex leaves you all euphoric but only to end with soreness later. So, what is it that is going wrong? Can it be remedied? [link] => https://timesofindia.indiatimes.com/life-style/health-fitness/health-news/5-reasons-you-get-sore-after-sex/life-style/health-fitness/health-news/5-reasons-you-get-sore-after-sex/photostory/61101724.cms [date] => Mon, 16 Oct 2017 10:21:27 GMT )...