Question

我正在尝试研究如何让scrapy返回嵌套数据结构，因为我可以找到的唯一例子涉及扁平结构。

我正在尝试搜索一个由一系列主题组成的论坛，每个帖子都有一个帖子列表。

我可以成功地抓取线程列表和帖子列表，但我不知道如何将所有帖子附加到帖子中，而不是全部混在一起。

最后，我的目标是输出：

<thread id="1">
    <post>Post 1</post>
    <post>Post 2</post>
</thread>
<thread id="2">
    <post>Post A</post>
    <post>Post B</post>
</thread>

如果我这样做：

def parse(self, response):
    # For each thread on this page
    yield scrapy.Request(thread_url, self.parse_posts)

def parse_posts(self, response):
    # For each post on this page
    yield {'content': ... }

然后我只获得所有帖子的列表，而不将它们排列成线程。这样的事情当然不起作用：

def parse(self, response):
    # For each thread on this page
    yield {
        'id': ...,
        'posts': scrapy.Request(thread_url, self.parse_posts)
    }

所以我不确定如何让“孩子”请求进入“父”对象。

Answer 1

就像JimmyZhang所说的那样获得这种关联，这正是meta的用途。在产生请求之前从线程列表页面中解析ID，通过meta关键字将该线程ID传递给请求，然后在处理帖子时访问ID。

def parse(self, response):
    # For each thread on this page
    thread_id = sel.xpath('thread_id_getter_xpath').extract()
    yield scrapy.Request(thread_url, callback=self.parse_posts,
                         meta={'thread_id': thread_id})

def parse_posts(self, response):
    # For each post on this page
    thread_id = response.meta['thread_id'])
    yield {'thread_id': thread_id, 'content': ... }

此时，项目已关联。如何将数据编译为分层格式完全取决于您，并且取决于您的需求。例如，您可以编写一个管道来在字典中编译它并在爬网结束时输出它。

def process_item(self, item, spider)
    # Assume self.forum is an empty dict at initialization 
    self.forum.setdefault(item.thread_id, [])
    self.forum[item.thread_id].append(['post': item.post_id, 
                                       'content': item.content])

def close_spider(self, spider)
    # Do something with self.forum, like output it as XML or JSON
    # ... or just print it to the stdout.
    print self.forum

或者您可以逐步编译XML树。或者将每个项目序列化为JSON字符串并逐行转储到文件中。或者随时添加项目到数据库。或者你需要的其他任何东西。

Answer 2

您可以使用元数据。

首先：

yield : scrapy.Request(thread_url, self.parse_posts,'meta'={'thread_id' : id})

其次，定义一个线程项：

class thread_item(Item):
    thread_id = Field()
    posts = Field()

第三，在parse_posts中获取thread_id：

thread_id = response.meta['thread_id']
# parse posts content, construct thread item
yield item

第四，编写管道，并输出线程项。

从蜘蛛返回嵌套结构

2 个答案: