我写了这样的代码:
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
$html = $data;
//parsing begins here:
$doc = new \DOMDocument();
@$doc->loadHTML($html);
$metas = $doc->getElementsByTagName('meta');
此代码目前正在运行,但有些URL会阻止PHP脚本以防止抓取。如何解决这个问题?
答案 0 :(得分:3)
添加user_agent它将起作用
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
答案 1 :(得分:2)
您可以使用以下方式提取所有元标记:
$tags = get_meta_tags('http://www.example.com/');
// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author']; // name
echo $tags['keywords']; // php documentation
echo $tags['description']; // a php manual
echo $tags['geo_position']; // 49.33;-86.59