Question

我有一只蜘蛛要抓取一个网址列表，比如

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]
 def parse(self, response):
    items = []
    records = response.xpath('//*[@id="feed-main-list"]/li')
    for rec in records:
        item = MyItem()
        item['spiderUrl'] = response.request.url
        item['url']     = rec.xpath('.//*[@class="feed-block-title"]/a/@href').extract_first().strip()
        item['title']   = rec.xpath('string(.//*[@class="feed-block-title"])').extract_first().strip()
        item['lastUpdate'] = 'success'
        items.append(item)
    return items

对于每个网址，我需要一起处理这些项目（分析数据，发生事情时发送电子邮件）并尽快处理。我选择管道来做到这一点。但是在管道中，它只是从中接收项目。因此，我尝试将项目打包到蜘蛛中的一个容器项目中。在蜘蛛中，

container = ContainerItem()
container['url'] = response.request.url
container['itemist'] = items
return [container]

并在管道中，

> def process_item(self, item, spider):
>     item['itemList']
>     n = len(item['itemList'])
>     for i in item['itemList']:
>         item = dict(i)
>         ...

所以，我的问题是： 1.根据我的要求实施它是一种好方法吗？ 2.将一个项目列表打包到一个容器项目中，看起来非常难看。有没有Scrapy风格的方法呢？

谢谢！

Answer 1

我认为最合乎逻辑的解决方案是将所有项目合并为一个。嵌套在字典中是一个非常常见的概念，它可能看起来很复杂和肮脏，但只要你没有达到10级深度就真的是最佳和容易的。

为此，只需将items列表包含在字典中，例如：

def parse(self, response):
    items = []
    records = response.xpath('//*[@id="feed-main-list"]/li')
    for rec in records:
        item = MyItem()
        item['spiderUrl'] = response.request.url
        item['url']     = rec.xpath('.//*[@class="feed-block-title"]/a/@href').extract_first().strip()
        item['title']   = rec.xpath('string(.//*[@class="feed-block-title"])').extract_first().strip()
        item['lastUpdate'] = 'success'
        items.append(item)
    return {'items': items}

现在您的管道将收到所有物品作为一个项目，您可以打开包装，分类并做任何您想做的事情。
在scrapy中，这种方法非常常见，甚至可以与ItemLoader一起使用，如果使用它们而不是纯scrapy.Item，那么澄清只是稍微修改过的python字典！

寻找更好的方法来处理来自一个网址的所有项目

1 个答案: