如何将目标页面的结果合并到scrapy中的当前页面?

时间:2011-12-11 21:38:08

标签: web-scraping scrapy

在scrapy中需要如何从一个页面获取链接然后关注此链接,从链接页面获取更多信息,并与第一页的某些数据合并...

感谢

4 个答案:

答案 0 :(得分:14)

在第一页上部分填写您的项目,并将其放入您的请求的元数据中。当调用下一页的回调时,它可以获取部分填充的请求,将更多数据放入其中,然后将其返回。

答案 1 :(得分:7)

有关传递meta数据和请求对象的更多信息,请参阅本文档的这一部分:

http://readthedocs.org/docs/scrapy/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

此问题还与:Scrapy: Follow link to get additional Item data?

有关

答案 2 :(得分:4)

来自scrapy documntation

的示例
def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                         callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

答案 3 :(得分:2)

Scrapy文档代码的一点说明

def start_requests(self):
        yield scrapy.Request("http://www.example.com/main_page.html",callback=parse_page1)
def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url ##extracts http://www.example.com/main_page.html
    request = scrapy.Request("http://www.example.com/some_page.html",callback=self.parse_page2)
    request.meta['my_meta_item'] = item ## passing item in the meta dictionary
    ##alternatively you can follow as below
    ##request = scrapy.Request("http://www.example.com/some_page.html",meta={'my_meta_item':item},callback=self.parse_page2)
    return request

def parse_page2(self, response):
    item = response.meta['my_meta_item']
    item['other_url'] = response.url ##extracts http://www.example.com/some_page.html
    return item