使用PHP CURL从sitemap.xml中提取所有URL

时间:2016-06-23 00:20:05

标签: php loops curl sitemap

我想用PHP和CURL从sitemap.xml中提取所有url。我的代码使用内容站点地图(例如:http://www.phanmemtoday.com/sitemap.xml?page=1)但不适用于站点地图索引(例如:http://www.phanmemtoday.com/sitemap.xml)。 请帮我。谢谢!

<?php
$sUrl="http://domain.com/sitemap.xml";

$aXmlLinks = array($sUrl);
$aOtherLinks = array();
while (!empty($aXmlLinks)) {
    foreach ($aXmlLinks as $i =>$sTmpUrl){
        unset($aXmlLinks[$i]);
        $aTmp = getlinkfromxmlsitemap($sTmpUrl);
        echo "Array temp link:<br>";
        print_r($aTmp);
        foreach ($aTmp as $sTmpUrl2) {
            if (strpos($sTmpUrl2, '.xml') !== false) {
                array_push($aXmlLinks,$sTmpUrl2);
            } else {
                array_push($aOtherLinks,$sTmpUrl2);
            }
        }
    }
    echo "<br>Array xml link:<br>";
    print_r($aXmlLinks);
    echo "<br>Array product link:<br>";
    print_r($aOtherLinks);
    echo "<br>-----------------------------------------<br>";
}
print_r($aOtherLinks);


function getlinkfromxmlsitemap($sUrl) {
    // echo "Get link from: $sUrl<br>";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch,CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0");
    curl_setopt($ch, CURLOPT_URL, $sUrl);
    $data = curl_exec($ch);
    curl_close($ch);
    $links = array();
    $count = preg_match_all('@<loc>(.+?)<\/loc>@', $data, $matches);
    for ($i = 0; $i < $count; ++$i) {
        $links[] = $matches[0][$i];
    }
    return $links;  
}
?>

1 个答案:

答案 0 :(得分:0)

您的代码运行良好,但您可以改进一些事情,请检查以下示例,thar将返回一个嵌套数组,其中包含您所查找的链接:

<?php
$sUrl1="http://www.phanmemtoday.com/sitemap.xml?page=1";
$sUrl2="http://www.phanmemtoday.com/sitemap.xml";

$aXmlLinks = array($sUrl1,$sUrl2);
$aOtherLinks = array();
while (!empty($aXmlLinks)) {
    foreach ($aXmlLinks as $i =>$sTmpUrl){
        unset($aXmlLinks[$i]);
        $aTmp = getlinkfromxmlsitemap($sTmpUrl);        
        array_push($aOtherLinks,$aTmp);        
    }
    echo "<br>Array xml link:<br>";
    print_r($aXmlLinks);
    echo "<br>Array product link:<br>";
    print_r($aOtherLinks);
    echo "<br>-----------------------------------------<br>";
}
print_r($aOtherLinks);


function getlinkfromxmlsitemap($sUrl) {
    // echo "Get link from: $sUrl<br>";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch,CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0");
    curl_setopt($ch, CURLOPT_URL, $sUrl);
    $data = curl_exec($ch);
    $error= curl_error($ch);
    curl_close($ch);
    $links = array();
    $count = preg_match_all('@<loc>(.+?)<\/loc>@', $data, $matches);
    for ($i = 0; $i < $count; ++$i) {
        $links[] = $matches[0][$i];
    }
    return $links;  
}
?>