我想用PHP和CURL从sitemap.xml中提取所有url。我的代码使用内容站点地图(例如:http://www.phanmemtoday.com/sitemap.xml?page=1)但不适用于站点地图索引(例如:http://www.phanmemtoday.com/sitemap.xml)。 请帮我。谢谢!
<?php
$sUrl="http://domain.com/sitemap.xml";
$aXmlLinks = array($sUrl);
$aOtherLinks = array();
while (!empty($aXmlLinks)) {
foreach ($aXmlLinks as $i =>$sTmpUrl){
unset($aXmlLinks[$i]);
$aTmp = getlinkfromxmlsitemap($sTmpUrl);
echo "Array temp link:<br>";
print_r($aTmp);
foreach ($aTmp as $sTmpUrl2) {
if (strpos($sTmpUrl2, '.xml') !== false) {
array_push($aXmlLinks,$sTmpUrl2);
} else {
array_push($aOtherLinks,$sTmpUrl2);
}
}
}
echo "<br>Array xml link:<br>";
print_r($aXmlLinks);
echo "<br>Array product link:<br>";
print_r($aOtherLinks);
echo "<br>-----------------------------------------<br>";
}
print_r($aOtherLinks);
function getlinkfromxmlsitemap($sUrl) {
// echo "Get link from: $sUrl<br>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0");
curl_setopt($ch, CURLOPT_URL, $sUrl);
$data = curl_exec($ch);
curl_close($ch);
$links = array();
$count = preg_match_all('@<loc>(.+?)<\/loc>@', $data, $matches);
for ($i = 0; $i < $count; ++$i) {
$links[] = $matches[0][$i];
}
return $links;
}
?>
答案 0 :(得分:0)
您的代码运行良好,但您可以改进一些事情,请检查以下示例,thar将返回一个嵌套数组,其中包含您所查找的链接:
<?php
$sUrl1="http://www.phanmemtoday.com/sitemap.xml?page=1";
$sUrl2="http://www.phanmemtoday.com/sitemap.xml";
$aXmlLinks = array($sUrl1,$sUrl2);
$aOtherLinks = array();
while (!empty($aXmlLinks)) {
foreach ($aXmlLinks as $i =>$sTmpUrl){
unset($aXmlLinks[$i]);
$aTmp = getlinkfromxmlsitemap($sTmpUrl);
array_push($aOtherLinks,$aTmp);
}
echo "<br>Array xml link:<br>";
print_r($aXmlLinks);
echo "<br>Array product link:<br>";
print_r($aOtherLinks);
echo "<br>-----------------------------------------<br>";
}
print_r($aOtherLinks);
function getlinkfromxmlsitemap($sUrl) {
// echo "Get link from: $sUrl<br>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0");
curl_setopt($ch, CURLOPT_URL, $sUrl);
$data = curl_exec($ch);
$error= curl_error($ch);
curl_close($ch);
$links = array();
$count = preg_match_all('@<loc>(.+?)<\/loc>@', $data, $matches);
for ($i = 0; $i < $count; ++$i) {
$links[] = $matches[0][$i];
}
return $links;
}
?>