假设有一个产品具有以下json结构,即具有多个要抓取的链接的产品。
[
{
"id": "888",
"suppliers": {
"shop1": {
"url": "http://www.example1.com./item1",
"price": "19.99",
},
"shop2": {
"url": "http://www.example2.com./item2",
"price": "29.95",
}
}
}
]
我正在使用Scrapy抓取这两个网站并更新价格。 除了Scrapy分别返回两个结果外,一切正常。
我怎样才能"结合"从两个链接的结果?即在一行中形成像上述json结构的单个物体?
这是我正在使用的现有代码段。任何帮助将不胜感激。
class ProductSpider(Spider):
name = "productspider"
allowed_domains = ['example1.com', 'example2.com']
start_urls = ['http://www.example1.com./item1', 'http://www.example2.com./item2']
def parse(self, response):
item = ProductItem()
item['id'] = '888'
item['suppliers'] = {'shop1':'', 'shop2':''}
if (response.meta['download_slot'] == 'www.example1.com'):
parse_example1_page() # and assign it to item shop1
if (response.meta['download_slot'] == 'www.example2.com'):
parse_example2_page() # and assign it to item shop2
yield item
答案 0 :(得分:0)
您想要的输出是您重新组织的数据。尝试将提取和处理部分结合起来会很脆弱,可能很难理解。取数的数据甚至可能以其原始形式有用(您可以组合不同的爬行,执行不同的处理等)。考虑将任务分为两部分:抓取以获取数据,然后处理以重新格式化数据。你已经有了刮削部分,这是后处理的一个例子。我使用了一种简单的每行记录一次json格式,其优点是不需要在内存中加载整个(原始)数据集。您可以使用您喜欢的任何中间存储。
import json
from collections import defaultdict
# the (fake) fetching
scrapy_data = [ {"id":"888", "url":"blah.com/888", "shop":"shop1", "price": 99.2 },
{"id":"3", "url":"blah.com/3", "shop":"shop1", "price": 33.1 },
{"id":"888", "url":"foo.com/888", "shop":"shop2", "price": 423.0 },
{"id":"42", "url":"foo.com/42", "shop":"shop2", "price": 1.20 }, ]
with open('records.json','w') as fh:
# pretend the data items are coming from scrapy
for item in scrapy_data:
json.dump(item, fh)
fh.write("\n")
# the (real) processing
products = defaultdict(dict)
with open('records.json') as fh:
for line in fh:
item = json.loads(line)
pid, url, shop, price = item["id"], item["url"], item["shop"], item["price"]
products[pid][shop] = {"url": url, "price":price}
collated = [ { "id": key, "suppliers":val } for key, val in products.iteritems() ]
print(json.dumps(collated, sort_keys=True, indent=2))
输出如下:
[
{
"id": "3",
"suppliers": {
"shop1": {
"price": 33.1,
"url": "blah.com/3"
}
}
},
{
"id": "888",
"suppliers": {
"shop1": {
"price": 99.2,
"url": "blah.com/888"
},
"shop2": {
"price": 423.0,
"url": "foo.com/888"
}
}
},
{
"id": "42",
"suppliers": {
"shop2": {
"price": 1.2,
"url": "foo.com/42"
}
}
}
]