使用Scrapy

时间:2015-11-26 12:52:15

标签: javascript python scrapy dopostback

以下页面通过执行Javascript请求来访问产品详细信息: http://www.ooshop.com/ContentNavigation.aspx?TO_NOEUD_IDMO=N000000013143&FROM_NOEUD_IDMO=N000000013131&TO_NOEUD_IDFO=81080&NOEUD_NIVEAU=2&UNIVERS_INDEX=3

每个产品都有以下元素:

<a id="ctl00_cphC_pn3T1_ctl01_rp_ctl00_ctl00_lbVisu" class="prodimg" href="javascript:__doPostBack('ctl00$cphC$pn3T1$ctl01$rp$ctl00$ctl00$lbVisu','')"><img id="ctl00_cphC_pn3T1_ctl01_rp_ctl00_ctl00_iVisu" title="Visualiser la fiche détail" class="image" onerror="this.src='/Media/images/null.gif';" src="Media/ProdImages/Produit/Vignettes/3270190199359.gif" alt="Dés de jambon" style="height:70px;width:70px;border-width:0px;margin-top:15px"></a>

我尝试使用Scrapy librairies中的FormRequest来抓取这些页面,但它似乎不起作用: <python>

import scrapy
from scrapy.http import FormRequest
from JStest.items import JstestItem

class ooshoptest2(scrapy.Spider):
    name = "ooshoptest2"
    allowed_domains = ["ooshop.com"]
    start_urls = ["http://www.ooshop.com/courses-en-ligne/ContentNavigation.aspx?TO_NOEUD_IDMO=N000000013143&FROM_NOEUD_IDMO=N000000013131&TO_NOEUD_IDFO=81080&NOEUD_NIVEAU=2&UNIVERS_INDEX=3"]

    def parse(self, response):
        URL=response.url
        path='//div[@class="blockInside"]//ul/li/a'
        for balise in response.xpath(path):
            jsrequest = response.urljoin(balise.xpath('@href').extract()[0]
            js="'"+jsrequest[25:-5]+"'"
            data = {'__EVENTTARGET': js,'__EVENTARGUMENT':''}

            yield FormRequest(url=URL,    
                           method='POST',
                           callback=self.parse_level1,
                           formdata=data,
                           dont_filter=True)            

    def parse_level1(self, response):    

        path='//div[@class="popContent"]'
        test=response.xpath(path)[0].extract()
        print test
        item=JstestItem()

        yield item

有谁知道如何使这项工作? 非常感谢!

0 个答案:

没有答案