我遇到这种情况:
我想从描述产品的特定产品详细信息页面(页面A)抓取产品详细信息,此页面包含指向此产品的卖家(页面B)列表的页面的链接,每个卖家都链接到包含卖家详细信息的另一页(页面C),这是一个示例模式:
第A页:
Page B:
第C页:
这是我想在抓取后获得的json:
{
"product_name": "product1",
"sellers": [
{
"seller_name": "seller1",
"seller_price": 100,
"seller_address": "address1",
},
(...)
]
}
我尝试过:将产品信息从解析方法传递到元对象中的第二个解析方法,这在2个级别上工作正常,但我有3个,我想要一个项目。
这在scrapy中是否可行?
编辑:
这里要求的是我正在尝试做的一个缩小的例子,我知道它不会按预期工作,但我无法弄清楚如何让它只返回1个组合对象:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'examplespider'
allowed_domains = ["example.com"]
start_urls = [
'http://example.com/products/product1'
]
def parse(self, response):
# assume this object was obtained after
# some xpath processing
product_name = 'product1'
link_to_sellers = 'http://example.com/products/product1/sellers'
yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={
'product': {
'product_name': product_name,
'sellers': []
}
})
def parse_sellers(self, response):
product = response.meta['product']
# assume this object was obtained after
# some xpath processing
sellers = [
{
seller_name = 'seller1',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller1',
},
{
seller_name = 'seller2',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller2',
},
{
seller_name = 'seller3',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller3',
}
]
for seller in sellers:
product['sellers'].append(seller)
yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})
def parse_seller(self, response):
seller = response.meta['seller']
# assume this object was obtained after
# some xpath processing
seller_address = 'seller_address1'
seller['seller_address'] = seller_address
yield seller
答案 0 :(得分:0)
您需要更改一下逻辑,以便一次查询一个卖家地址,一旦完成,您就会查询其他卖家。
def parse_sellers(self, response):
meta = response.meta
# assume this object was obtained after
# some xpath processing
sellers = [
{
seller_name = 'seller1',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller1',
},
{
seller_name = 'seller2',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller2',
},
{
seller_name = 'seller3',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller3',
}
]
current_seller = sellers.pop()
if current_seller:
meta['pending_sellers'] = sellers
meta['current_seller'] = current_seller
yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
else:
yield product
# for seller in sellers:
# product['sellers'].append(seller)
# yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})
def parse_seller(self, response):
meta = response.meta
current_seller = meta['current_seller']
sellers = meta['pending_sellers']
# assume this object was obtained after
# some xpath processing
seller_address = 'seller_address1'
current_seller['seller_address'] = seller_address
meta['product']['sellers'].append(current_seller)
if sellers:
current_seller = sellers.pop()
meta['pending_sellers'] = sellers
meta['current_seller'] = current_seller
yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
else:
yield meta['product']
但这不是一个好方法,因为卖家可能会卖多件物品。因此,当您再次由同一卖家联系到该商品时,您的卖家地址请求将被欺骗过滤器拒绝。您可以通过向请求添加dont_filter=True
来解决此问题,但这意味着网站上有太多不必要的点击
所以你需要在代码中直接添加DB处理以检查你是否已经有卖家的详细信息,如果是,则使用它们,如果没有则需要获取详细信息。
答案 1 :(得分:0)
我认为pipeline可以提供帮助。
假设屈服seller
采用以下格式(可以通过对代码的一些微不足道的修改来完成):
seller = {
'product_name': 'product1',
'seller': {
'seller_name': 'seller1',
'seller_price': 100,
'seller_address': 'address1',
}
}
以下管道将按其product_name
收集卖家,并在抓取后导出到名为'items.jl'的文件(请注意,这只是该想法的草图,因此无法保证工作):
class CollectorPipeline(object):
def __init__(self):
self.collection = {}
def open_spider(self, spider):
self.collection = {}
def close_spider(self, spider):
with open("items.jl", "w") as fp:
for _, product in self.collection.items():
fp.write(json.dumps(product))
fp.write("\n")
def process_item(self, item, spider):
product = self.collection.get(item["product_name"], dict())
product["product_name"] = item["product_name"]
sellers = product.get("sellers", list())
sellers.append(item["seller"])
return item
顺便说一下,您需要修改settings.py
以使管道生效,如scrapy document中所述。