我正在使用Scrapy(1.1.2)制作蜘蛛来废弃产品。我设法让它工作并抓取足够的数据,但现在,我希望每个元素都向product page
和废料提出新的请求,例如产品说明。
首先,这是我最后的工作代码
spider.py (除了)
class ProductScrapSpider(Spider):
name = "dmoz"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/index.php?id_category=24"
# ...
]
def parse(self, response):
for sel in response.xpath("a long string"):
mainloader = ProductLoader(selector=sel)
mainloader.add_value('category', 'Category Name')
mainloader.add_value('meta', self.get_meta(sel))
# more data
yield mainloader.load_item()
# Follows the pagination
next_page = response.css("li#pagination_next a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def get_meta(self, response):
metaloader = ProductMetaLoader(selector=response)
metaloader.add_value('store', "Store name")
# more data
yield metaloader.load_item()
输出
[
{
"category": "Category Name",
"price": 220000,
"meta": {
"baseURL": "",
"name": "",
"store": "Store Name"
},
"reference": "100XXX100"
},
...
]
在阅读完文档及其中的一些答案之后,我更改了get_meta
方法,并为请求添加了回调get_product_page
:
new_spider.py (除了)
def get_meta(self, response):
metaloader = ProductMetaLoader(selector=response)
metaloader.add_value('store', "Store name")
# more data
items = metaloader.load_item()
new_request = scrapy.Request(items['url'], callback=self.get_product_page)
# Passing the metadata
new_request.meta['item'] = items
# The source of the problem
yield new_request
def get_product_page(self, response):
sel = response.selector.css('.product_description')
items = response.meta['item']
new_meta = items
new_meta.update({'product_page': sel[0].extract()})
return new_meta
预期输出
[
{
"category": "Category Name",
"price": 220000,
"meta": {
"baseURL": "",
"name": "",
"store": "Store Name",
"product_page": "<div> [...] </div>"
},
"reference": "100XXX100"
},
...
]
错误
TypeError: 'Request' object is not iterable
我找不到任何关于此错误的信息,所以请帮我解决。
非常感谢。
答案 0 :(得分:1)
您遇到的错误(TypeError: 'Request' object is not iterable
)是因为Request
实例被放入项目的字段中(在更新的get_meta
方法函数中),而Feed导出器不能序列化它。
您需要将获取元请求返回给Scrapy,并使用meta参数传递半分析项。以下是更新后的parse
方法和新parse_get_meta
方法的示例:
def parse(self, response):
for sel in response.xpath("a long string"):
mainloader = ProductLoader(selector=sel)
mainloader.add_value('category', 'Category Name')
#mainloader.add_value('meta', self.get_meta(sel))
# more data
item = mainloader.load_item()
get_meta_req = self.get_meta(sel)
get_meta_req['meta']['item'] = item
yield get_meta_req.replace(callback=self.parse_get_meta)
def parse_get_meta(self, response):
"""Parses a get meta response"""
item = response.meta['item']
# Parse the response and load the data here, e.g. item['foo'] = bar
pass
# Finally return the item
return item