什么相当于SCRAPY中的CURL

时间:2016-04-21 11:09:47

标签: php python-2.7 scrapy web-crawler scrapy-spider

我想用SCRAPY用AJAX PAGINATION抓一个网站,我用PHP通过PHP抓取这个网站,我用Firebug监控网络,用firebug我们有一个选项“Copy for CURL”用于POST REQUEST。 我的问题是如何为SCRAPY做同样的事情。

我的PHP函数:

   function forCurl($url,$refer, $jsessionid){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0');
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: no-cache' --data 't%3Azoneid=forceAjax";
    $header[] = "Connection: keep-alive";
    $header[] = "Accept-Language: fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3";
    $header[] = "Pragma: no-cache";
      $header[] = "X-Requested-With: XMLHttpRequest";

  $header[] = "Keep-Alive: 700";
  $cookie = "JSESSIONID=" . $jsessionid. '; langueFront=fr; tc_cj_v2=%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLMQOMROKJRZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLMQOSQMMRNZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNNJJKNRRMZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNNKNJOJSKZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNNMLSNSKLZZZ%5D777_rn_lh%5BfyfcheZZZ222H%7B0%7D%23%7B%29H%21-ZZZKNLLNNMMLMJNJZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNOOJSKRKMZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNOOLSOMPNZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNOPJMROQLZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNOPMQSKNOZZZ%5D; _ga=GA1.2.487921595.1421941922; aurol=GA1.2.865695137.1421941922; __utma=239562643.487921595.1421941922.1422452658.1422454606.14; __utmz=239562643.1422443324.10.2.utmcsr=Sphere_myWebSite|utmccn=myWebSitefr_logo|utmcmd=Interne; kameleoonVisitIdentifier=rj1hnzh5ux1n2gxr/4; myWebSiteCook=\"869|\"; revelationDriveWin=2; myWebSite.hamon=1; __utmv=239562643.|1=visite_myWebSitedrive=239562643.487921595.1421941922.1422452658.1422454606.14=1; tosend=%7B%22p%22%3A%7B%22tracker%22%3A%22myWebSitedrive%22%2C%20%22url%22%3A%22rayon%22%2C%20%22mtime%22%3A1422455760000%2C%20%22ref%22%3A%22http%3A%2F%2Fwww.myWebSitedrive.fr%2Fdrive%2Frecherche%2Fbio%22%2C%20%22dest%22%3A%22http%3A%2F%2Fwww.myWebSitedrive.fr%2Fdrive%2FNice-Cote-dAzur-869%2FSurgeles-R41355%2FViandes-Volailles-41478%2F%22%7D%2C%22d%22%3A%7B%22dv%22%3A%22NA%22%7D%2C%20%22t%22%3A%7B%22iplobserverstart%22%3A%221422455762613%22%2C%22jsinit%22%3A%221422455763871%22%2C%22domload%22%3A%221422455764728%22%2C%22clicklink%22%3A%221422455817128%22%2C%22unload%22%3A%221422455817521%22%7D%7D; kameleoonExperiment-14570=86018/1422452656881/false; __utmc=239562643; rdmvalidation=1; layerDrivePromos=2; __utmb=239562643.19.10.1422454606; _gat=1; _gat_myWebSiteRollup=1; __utmt=1; __utmt_secondTracker=1; __utmli=toPage_14b30fac8d4_0';
  curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
  curl_setopt($ch, CURLOPT_REFERER, $refer);
  curl_setopt($ch, CURLOPT_COOKIE, $cookie);
  $content = curl_exec($ch);
  curl_close($ch);
  return $content ;

我想知道如何使用SCRAPY发布相同的参数,对于使用ajax分页来抓取网站是个好主意吗?

我试过这个:

yield Request(sousUrl, headers={'Referer':'%s' % url},  callback=self.parse_page)

1 个答案:

答案 0 :(得分:0)

在Python中,您可以使用PyCurl

PycURL是libcurl的Python接口。