如何加入parse()导致scrapy

时间:2015-02-10 01:03:45

标签: python scrapy

假设有一个产品具有以下json结构,即具有多个要抓取的链接的产品。

[
  {
    "id": "888",
    "suppliers": {
      "shop1": {
        "url": "http://www.example1.com./item1",
        "price": "19.99",
      },
      "shop2": {
        "url": "http://www.example2.com./item2",
        "price": "29.95",
      }
    }
  }
]

我正在使用Scrapy抓取这两个网站并更新价格。 除了Scrapy分别返回两个结果外,一切正常。

我怎样才能"结合"从两个链接的结果?即在一行中形成像上述json结构的单个物体?

这是我正在使用的现有代码段。任何帮助将不胜感激。

class ProductSpider(Spider):
    name = "productspider"
    allowed_domains = ['example1.com', 'example2.com']
    start_urls = ['http://www.example1.com./item1', 'http://www.example2.com./item2']

    def parse(self, response):    
        item = ProductItem()
        item['id'] = '888'
        item['suppliers'] = {'shop1':'', 'shop2':''}

        if (response.meta['download_slot'] == 'www.example1.com'):
            parse_example1_page() # and assign it to item shop1

        if (response.meta['download_slot'] == 'www.example2.com'):
            parse_example2_page() # and assign it to item shop2

        yield item

1 个答案:

答案 0 :(得分:0)

您想要的输出是您重新组织的数据。尝试将提取和处理部分结合起来会很脆弱,可能很难理解。取数的数据甚至可能以其原始形式有用(您可以组合不同的爬行,执行不同的处理等)。考虑将任务分为两部分:抓取以获取数据,然后处理以重新格式化数据。你已经有了刮削部分,这是后处理的一个例子。我使用了一种简单的每行记录一次json格式,其优点是不需要在内存中加载整个(原始)数据集。您可以使用您喜欢的任何中间存储。

import json
from collections import defaultdict

# the (fake) fetching
scrapy_data = [ {"id":"888", "url":"blah.com/888", "shop":"shop1", "price": 99.2 },
{"id":"3", "url":"blah.com/3", "shop":"shop1", "price": 33.1 },
{"id":"888", "url":"foo.com/888", "shop":"shop2", "price": 423.0 },
{"id":"42", "url":"foo.com/42", "shop":"shop2", "price": 1.20 }, ]

with open('records.json','w') as fh:
    # pretend the data items are coming from scrapy
    for item in scrapy_data:
        json.dump(item, fh)
        fh.write("\n")


# the (real) processing
products = defaultdict(dict)

with open('records.json') as fh:
    for line in fh:
        item = json.loads(line)
        pid, url, shop, price = item["id"], item["url"], item["shop"], item["price"]
        products[pid][shop] = {"url": url, "price":price}

collated = [ { "id": key, "suppliers":val } for key, val in products.iteritems() ]

print(json.dumps(collated, sort_keys=True, indent=2))

输出如下:

[
  {
    "id": "3", 
    "suppliers": {
      "shop1": {
        "price": 33.1, 
        "url": "blah.com/3"
      }
    }
  }, 
  {
    "id": "888", 
    "suppliers": {
      "shop1": {
        "price": 99.2, 
        "url": "blah.com/888"
      }, 
      "shop2": {
        "price": 423.0, 
        "url": "foo.com/888"
      }
    }
  }, 
  {
    "id": "42", 
    "suppliers": {
      "shop2": {
        "price": 1.2, 
        "url": "foo.com/42"
      }
    }
  }
]