PHP - Web抓取 - 如何使用cURL进行缓存?

时间:2014-01-31 12:30:36

标签: php curl web-scraping

我有一个脚本,其中包含一个网址列表,我从这些网址获取信息,如姓名,城市,部门等

这些是我的一些功能:

function getCity($url)
    {
    $url = curl_get_contents($url);
    $html_object = str_get_html($url);
    return $ret = $html_object->find('td', 86)->plaintext;
    }

function getDepartment($url)
    {
    $url = curl_get_contents($url);
    $html_object = str_get_html($url);
    return $ret = $html_object->find('td', 90)->plaintext;
    }

function getSalary($url)
    {
    $url = curl_get_contents($url);
    $html_object = str_get_html($url);
    $ret = $html_object->find('td', 94)->plaintext;
    return trim($ret);
    }

这是我的cURL代码:

function curl_get_contents($url)
{
  $curl_moteur = curl_init();
  curl_setopt($curl_moteur, CURLOPT_URL, $url);
  curl_setopt($curl_moteur, CURLOPT_RETURNTRANSFER, 1);

  curl_setopt($curl_moteur,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

  curl_setopt($curl_moteur, CURLOPT_FOLLOWLOCATION, 1);
  $web = curl_exec($curl_moteur);
  curl_close($curl_moteur);
  return $web;
}

如您所见,我正在为每个字段提出请求,效率非常低。 我想实现一个缓存,以便只提取一次请求每个URL的所有信息字段。

提前致谢。

1 个答案:

答案 0 :(得分:0)

您可以通过以下功能创建一个类:

  class Scrapper
{
    public $page_content;

    public $html_object;

    public function __construct($url)
    {
        $this->page_content = $this->curl_get_contents($url); //in case you want to keep for something scrapped url content
        $this->html_object  = $this->str_get_html($this->page_content); //create object from html, probably simpleXML
    }

    public function getCity()
    {
        return $this->html_object->find('td', 86)->plaintext;
    }

    public function getDepartment()
    {
        return $this->html_object->find('td', 90)->plaintext;
    }

    public function getSalary()
    {

        $ret = $this->html_object->find('td', 94)->plaintext;
        return trim($ret);
    }

    public function curl_get_contents($url)
    {
        $curl_moteur = curl_init();
        curl_setopt($curl_moteur, CURLOPT_URL, $url);
        curl_setopt($curl_moteur, CURLOPT_RETURNTRANSFER, 1);

        curl_setopt($curl_moteur,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

        curl_setopt($curl_moteur, CURLOPT_FOLLOWLOCATION, 1);
        $web = curl_exec($curl_moteur);
        curl_close($curl_moteur);
        return $web;
    }

    public function str_get_html()
    {
        //unkown function content
        $this->html_object = $some_object; // $some_object = str_get_html($url) from your code;
    }
}

$scrapper = new Scrapper($your_url);

echo $scrapper->getCity();
echo $scrapper->getDepartment();

请注意,代码未经测试。

这样,您可以在实例化类时请求url。

或者如果您不想使用对象,则可以使用static变量轻松修复:

function curl_get_contents($url)
{
  static $web = null;
  if (!is_null($web)) {
     return $web;
  }

  $curl_moteur = curl_init();
  curl_setopt($curl_moteur, CURLOPT_URL, $url);
  curl_setopt($curl_moteur, CURLOPT_RETURNTRANSFER, 1);

  curl_setopt($curl_moteur,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

  curl_setopt($curl_moteur, CURLOPT_FOLLOWLOCATION, 1);
  $web = curl_exec($curl_moteur);
  curl_close($curl_moteur);
  return $web;
}