Question

我目前正在尝试使用“ PHP Simple HTML DOM Parser ”（http://simplehtmldom.sourceforge.net/）来执行一些基本的网页抓取。

它似乎与某些网站完美配合，但不是其他网站。例如，它适用于Google.com，但不适用于JobServe.com搜索。

echo file_get_html('https://www.jobserve.com/gb/en/JobListing.aspx?shid=BB2D6366D16054EF')->plaintext; 
echo file_get_html('http://www.google.com/')->plaintext;

错误：

Notice: Trying to get property of non-object in     
C:\wamp\www\PHP_SCRAPER\_jobs\jobs_dom.php on line 11
Call Stack
#   Time    Memory  Function    Location
1   0.0003  365808  {main}( )   ..\jobs_dom.php:0

什么阻止DOM Parser阅读网站？我是否需要保存本地副本并清理标题？

Answer 1

您可能需要使用curl来捕获页面，因为您有更多错误检查选项作为示例

//Create a curl handle to a non-existing location
$ch = curl_init('http://google.com');

curl_setopt($ch, CURLOPT_FAILONERROR, true);
   // return the value instead of printing the response to browser
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
//set a time out
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,20);
//handle redirected pages
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
curl_setopt($ch,CURLOPT_ENCODING,'identity');
//mimick a useragent
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0'); 
if (curl_exec($ch) === false) {
    echo 'page not found';//error handling
}
else
{
    $html = new simple_html_dom();
    // Load HTML from a string
    $html->load(curl_exec($ch));
    echo $html
}

PHP简单的HTML DOM解析器 - 无法解析某些网页

1 个答案: