我目前正在尝试使用“ PHP Simple HTML DOM Parser ”(http://simplehtmldom.sourceforge.net/)来执行一些基本的网页抓取。
它似乎与某些网站完美配合,但不是其他网站。例如,它适用于Google.com,但不适用于JobServe.com搜索。
echo file_get_html('https://www.jobserve.com/gb/en/JobListing.aspx?shid=BB2D6366D16054EF')->plaintext;
echo file_get_html('http://www.google.com/')->plaintext;
错误:
Notice: Trying to get property of non-object in
C:\wamp\www\PHP_SCRAPER\_jobs\jobs_dom.php on line 11
Call Stack
# Time Memory Function Location
1 0.0003 365808 {main}( ) ..\jobs_dom.php:0
什么阻止DOM Parser阅读网站?我是否需要保存本地副本并清理标题?
答案 0 :(得分:0)
您可能需要使用curl来捕获页面,因为您有更多错误检查选项作为示例
//Create a curl handle to a non-existing location
$ch = curl_init('http://google.com');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
// return the value instead of printing the response to browser
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
//set a time out
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,20);
//handle redirected pages
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
curl_setopt($ch,CURLOPT_ENCODING,'identity');
//mimick a useragent
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0');
if (curl_exec($ch) === false) {
echo 'page not found';//error handling
}
else
{
$html = new simple_html_dom();
// Load HTML from a string
$html->load(curl_exec($ch));
echo $html
}