我对python和scrapy非常陌生。我写了一个使用scrapy的工作脚本,需要一些改进以避免冗余。
在parse_article_page函数中,我遇到了两种可能性。该文章是否有变体(更多页面要废弃)或不。你能帮我避免在else语句和parse_data函数中双重使用代码吗?
我尝试了第二个请求,但这似乎不起作用。日志说" DEBUG:过滤重复请求"或者什么也没说。
def parse_article_page(self, response):
#Check for variants
variants = response.xpath('//div[@class="variants"]/select/option[not(@disabled)]/@variant_href').extract()
if len(variants) > 1:
for variant in variants:
variant_url = response.urljoin(variant)
#Request article variants:
yield scrapy.Request(variant_url, callback=self.parse_data)
else:
#yield scrapy.Request(response.url, callback=self.parse_data) #Does not work
item = ShopItem()
item['desc'] = response.css(description_selector).extract()
item['price'] = response.css(price_selector).extract()
item['itno'] = response.css(no_selector).extract()
item['url'] = response.url
yield item
def parse_data(self, response):
item = ShopItem()
item['desc'] = response.css(description_selector).extract()
item['price'] = response.css(price_selector).extract()
item['itno'] = response.css(no_selector).extract()
item['url'] = response.url
yield item
答案 0 :(得分:1)
调用else: self.parse_data(response)
无法正常工作,因为您仍然需要在该方法中生成该项以供scrapy使用,您必须执行以下操作:
def parse_article_page(self, response):
#Check for variants
variants = response.xpath('//div[@class="variants"]/select/option[not(@disabled)]/@variant_href').extract()
if len(variants) > 1:
for variant in variants:
variant_url = response.urljoin(variant)
#Request article variants:
yield scrapy.Request(variant_url, callback=self.parse_data)
else:
for item in self.parse_data(response):
yield item
def parse_data(self, response):
item = ShopItem()
item['desc'] = response.css(description_selector).extract()
item['price'] = response.css(price_selector).extract()
item['itno'] = response.css(no_selector).extract()
item['url'] = response.url
yield item