如何在PHP中获取元标记?

时间:2018-02-14 11:36:31

标签: php curl

我正在尝试导出以下网址元标记,但它不能正常工作以下结果 警告:get_meta_tags(https://www.washingtonpost.com/politics/white-house-reels-as-fbi-director-contradicts-official-claims-about-alleged-abuser/2018/02/13/f010f256-10d9-11e8-9570-29c9830535e5_story.html?tid=pm_pop):无法打开流:已达到重定向限制,正在中止。 有什么想法?

1 个答案:

答案 0 :(得分:-1)

首先,您需要拨打第1页的电话来设置Cookie,否则它无法正常工作

$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER,FALSE);
curl_setopt($ch,CURLOPT_URL,"https://www.washingtonpost.com");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13");
$cookieName = "";
if(isset($_COOKIE['PHPSESSID'])){
    $cookieName = $_COOKIE['PHPSESSID'];
}
curl_setopt( $ch, CURLOPT_COOKIEJAR, $_SERVER['DOCUMENT_ROOT'].'/logs/'.$cookieName.'.txt'); 
curl_setopt( $ch, CURLOPT_COOKIEFILE, $_SERVER['DOCUMENT_ROOT'].'/logs/'.$cookieName.'.txt');
curl_exec($ch);
curl_close($ch);

然后第二次调用以获取实际页面

$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER,FALSE);
curl_setopt($ch,CURLOPT_URL,"https://www.washingtonpost.com/politics/white-house-reels-as-fbi-director-contradicts-official-claims-about-alleged-abuser/2018/02/13/f010f256-10d9-11e8-9570-29c9830535e5_story.html?tid=pm_pop");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13");
$cookieName = "";
if(isset($_COOKIE['PHPSESSID'])){
    $cookieName = $_COOKIE['PHPSESSID'];
}
curl_setopt( $ch, CURLOPT_COOKIEJAR, LOG_DIR.'/'.$cookieName.'.txt');
curl_setopt( $ch, CURLOPT_COOKIEFILE, LOG_DIR.'/'.$cookieName.'.txt');
$page = curl_exec($ch);
curl_close($ch);

最后用DOMDocument我们解析dom树

libxml_use_internal_errors(true);
$siteData = new DOMDocument();
$siteData->loadHTML($page);

$metaElements = $siteData->getElementsByTagName("meta");
if($metaElements->item(0)==null){
    echo "ERROR";
}

$meta = array();
for($i=0;$i<$metaElements->length;$i++){
    $meta[$i] = array();
    for($j=0;$j<$metaElements->item($i)->attributes->length;$j++){
        $meta[$i][$j] = array($metaElements->item($i)->attributes->item($j)->name,$metaElements->item($i)->attributes->item($j)->value);
    }
}
print_r($meta);

meta存储在$ meta数组

你可以通过组织curl来实现这个代码。