我有一个脚本,其中包含一个网址列表,我从这些网址获取信息,如姓名,城市,部门等
这些是我的一些功能:
function getCity($url)
{
$url = curl_get_contents($url);
$html_object = str_get_html($url);
return $ret = $html_object->find('td', 86)->plaintext;
}
function getDepartment($url)
{
$url = curl_get_contents($url);
$html_object = str_get_html($url);
return $ret = $html_object->find('td', 90)->plaintext;
}
function getSalary($url)
{
$url = curl_get_contents($url);
$html_object = str_get_html($url);
$ret = $html_object->find('td', 94)->plaintext;
return trim($ret);
}
这是我的cURL代码:
function curl_get_contents($url)
{
$curl_moteur = curl_init();
curl_setopt($curl_moteur, CURLOPT_URL, $url);
curl_setopt($curl_moteur, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_moteur,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl_moteur, CURLOPT_FOLLOWLOCATION, 1);
$web = curl_exec($curl_moteur);
curl_close($curl_moteur);
return $web;
}
如您所见,我正在为每个字段提出请求,效率非常低。 我想实现一个缓存,以便只提取一次请求每个URL的所有信息字段。
提前致谢。
答案 0 :(得分:0)
您可以通过以下功能创建一个类:
class Scrapper
{
public $page_content;
public $html_object;
public function __construct($url)
{
$this->page_content = $this->curl_get_contents($url); //in case you want to keep for something scrapped url content
$this->html_object = $this->str_get_html($this->page_content); //create object from html, probably simpleXML
}
public function getCity()
{
return $this->html_object->find('td', 86)->plaintext;
}
public function getDepartment()
{
return $this->html_object->find('td', 90)->plaintext;
}
public function getSalary()
{
$ret = $this->html_object->find('td', 94)->plaintext;
return trim($ret);
}
public function curl_get_contents($url)
{
$curl_moteur = curl_init();
curl_setopt($curl_moteur, CURLOPT_URL, $url);
curl_setopt($curl_moteur, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_moteur,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl_moteur, CURLOPT_FOLLOWLOCATION, 1);
$web = curl_exec($curl_moteur);
curl_close($curl_moteur);
return $web;
}
public function str_get_html()
{
//unkown function content
$this->html_object = $some_object; // $some_object = str_get_html($url) from your code;
}
}
$scrapper = new Scrapper($your_url);
echo $scrapper->getCity();
echo $scrapper->getDepartment();
请注意,代码未经测试。
这样,您可以在实例化类时请求url。
或者如果您不想使用对象,则可以使用static
变量轻松修复:
function curl_get_contents($url)
{
static $web = null;
if (!is_null($web)) {
return $web;
}
$curl_moteur = curl_init();
curl_setopt($curl_moteur, CURLOPT_URL, $url);
curl_setopt($curl_moteur, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_moteur,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl_moteur, CURLOPT_FOLLOWLOCATION, 1);
$web = curl_exec($curl_moteur);
curl_close($curl_moteur);
return $web;
}