避免在python / scrapy

时间:2017-12-01 21:57:25

标签: python scrapy response extract

我对python和scrapy非常陌生。我写了一个使用scrapy的工作脚本,需要一些改进以避免冗余。

在parse_article_page函数中,我遇到了两种可能性。该文章是否有变体(更多页面要废弃)或不。你能帮我避免在else语句和parse_data函数中双重使用代码吗?

我尝试了第二个请求,但这似乎不起作用。日志说" DEBUG:过滤重复请求"或者什么也没说。

def parse_article_page(self, response):
    #Check for variants
    variants = response.xpath('//div[@class="variants"]/select/option[not(@disabled)]/@variant_href').extract()
    if len(variants) > 1:
        for variant in variants:
            variant_url = response.urljoin(variant) 
            #Request article variants:
            yield scrapy.Request(variant_url, callback=self.parse_data) 
    else:
        #yield scrapy.Request(response.url, callback=self.parse_data) #Does not work
        item = ShopItem()
        item['desc'] = response.css(description_selector).extract()
        item['price'] = response.css(price_selector).extract()
        item['itno'] = response.css(no_selector).extract()
        item['url'] = response.url
        yield item

def parse_data(self, response):
    item = ShopItem()
    item['desc'] = response.css(description_selector).extract()
    item['price'] = response.css(price_selector).extract()
    item['itno'] = response.css(no_selector).extract()
    item['url'] = response.url
    yield item

1 个答案:

答案 0 :(得分:1)

调用else: self.parse_data(response)无法正常工作,因为您仍然需要在该方法中生成该项以供scrapy使用,您必须执行以下操作:

def parse_article_page(self, response):
    #Check for variants
    variants = response.xpath('//div[@class="variants"]/select/option[not(@disabled)]/@variant_href').extract()
    if len(variants) > 1:
        for variant in variants:
            variant_url = response.urljoin(variant) 
            #Request article variants:
            yield scrapy.Request(variant_url, callback=self.parse_data) 
    else:
        for item in self.parse_data(response):
            yield item

def parse_data(self, response):
    item = ShopItem()
    item['desc'] = response.css(description_selector).extract()
    item['price'] = response.css(price_selector).extract()
    item['itno'] = response.css(no_selector).extract()
    item['url'] = response.url
    yield item