我正在Scrapy中构建一个简单的(ish)解析器,当涉及到scrapy和Python时,我很无知:-)在item.py
文件中我有thisItem()
的定义我指定到下面代码中的item
。所有人都在游泳,parse
使用回调来到parse_dir_content
...然后我意识到我需要刮掉额外的数据并创建另一个函数parse_other_content
。如何将item
中的内容添加到parse_other_content
?
import scrapy
from this-site.items import *
import re
import json
class DmozSpider(scrapy.Spider):
name = "ABB"
allowed_domains = ["this-site.com.au"]
start_urls = [
"https://www.this-site.com.au?page=1",
"https://www.this-site.com.au?page=2",
]
def parse(self, response):
for href in response.xpath('//h3/a/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//h1[@itemprop="name"]'):
item = thisItem()
item['title'] = sel.xpath('text()').extract()
item['rate'] = response.xpath('//div[@class="rate"]/div/span/text()').extract()
so = re.search( r'\d+', response.url)
propID = so.group()
item['propid'] = propID
item['link'] = response.url
yield scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content)
#yield item
def parse_other_content(self, reponse):
sel = json.loads(reponse.body)
item['rate_detail'] = sel["this"][0]["that"]
yield item
我知道我在这里缺少一些简单的东西,但我似乎无法弄明白。
答案 0 :(得分:2)
根据scrapy文档(http://doc.scrapy.org/en/1.0/topics/request-response.html#topics-request-response-ref-request-callback-arguments):
在某些情况下,您可能有兴趣将参数传递给那些回调函数,以便稍后在第二个回调中接收参数。您可以使用Request.meta属性。
在你的情况下,我会做这样的事情:
def parse_dir_contents(self, response):
for sel in response.xpath('//h1[@itemprop="name"]'):
item = thisItem()
...
request = scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content)
request.meta['item'] = item
yield request
def parse_other_content(self, response):
item = response.meta['item']
# do something with the item
return item
根据Steve(请参阅注释),您还可以将meta
数据字典作为关键字参数传递给Request
构造函数,如下所示:
def parse_dir_contents(self, response):
for sel in response.xpath('//h1[@itemprop="name"]'):
item = thisItem()
...
request = scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content, meta={'item':item})
yield request
答案 1 :(得分:0)
您可以通过将item
更改为parse_other_content()
,或将其作为参数发送给该功能,以允许self.item
显示self.
。 (第一个可能更容易。)
对于第一个解决方案,只需将def parse_dir_contents(self, response):
for sel in response.xpath('//h1[@itemprop="name"]'):
self.item = thisItem()
self.item['title'] = sel.xpath('text()').extract()
self.item['rate'] = response.xpath('//div[@class="rate"]/div/span/text()').extract()
so = re.search( r'\d+', response.url)
propID = so.group()
self.item['propid'] = propID
self.item['link'] = response.url
yield scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content)
#yield item
def parse_other_content(self, reponse):
sel = json.loads(reponse.body)
self.item['rate_detail'] = sel["this"][0]["that"]
yield self.item
添加到对项变量的任何引用中。这使整个班级都可以看到。
{{1}}