Scrapy连接不同的产品以获得收益

时间:2017-01-28 19:48:42

标签: scrapy scrapy-pipeline

我废弃了新闻网站。对于每一条新闻,都有内容和许多评论。我有2个项目,一个用于内容,另一个用于多个评论。 问题是内容,多个评论产生不同的请求。我想要新闻的内容,它的多个评论应该一起或作为一个产生或返回。管道计时或订单对我来说无关紧要。

在项目文件中:

class NewsPageItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    hour = scrapy.Field()
    image = scrapy.Field()
    image_url = scrapy.Field()
    top_content = scrapy.Field()
    parag = scrapy.Field()
    #comments = scrapy.Field()
    comments_count = scrapy.Field()

class CommentsItem(scrapy.Item):
    id_ = scrapy.Field()
    username = scrapy.Field()
    firstname = scrapy.Field()
    lastname = scrapy.Field()
    email = scrapy.Field()
    ip = scrapy.Field()
    userid = scrapy.Field()
    date = scrapy.Field()
    comment_text = scrapy.Field()
    comment_type_id = scrapy.Field()
    object_id = scrapy.Field()
    yes = scrapy.Field()
    no = scrapy.Field()

在Spider中,新闻内容及其许多评论都没有关联:

class NewsSpider(scrapy.Spider):
    ...

    def parse(self, response):
        for nl in news_links:
            yield scrapy.Request(url=nl, callback=self.new_parse)
            yield scrapy.Request(url=url, callback=self.comment_parse)

    def new_parse(self,response):
        item = BigParaItem()
        item['title'] = response.xpath(...).extract()
        ...
        yield item

    def comment_parse(self,response):
        data = json.loads(response.body.decode('utf8'))

        for comment in data.get('data', []):
            item = CommentsItem()
            item['id_'] = comment.get('Id')
            ...
            yield item

管道:

class NewsPagePipeline(object):
    def process_item(self, item, spider):
        return item

class CommentsPipeline(object):
    def process_item(self, item, spider):
        return item

如何连接项目或在产生时嵌套?

1 个答案:

答案 0 :(得分:0)

最好将请求链接起来并在回调之间传递新闻项,以使用meta (*)填充评论:

class NewsSpider(scrapy.Spider):
...

def parse(self, response):
    for nl in news_links:
        yield scrapy.Request(url=nl, callback=self.new_parse, meta={'comments_url': url})

def new_parse(self,response):
    item = BigParaItem()
    item['title'] = response.xpath(...).extract()
    item['comments'] = []
    ...
    yield scrapy.Request(response.meta['comments_url'], callback=self.comment_parse, meta={'item': item})

def comment_parse(self,response):
    data = json.loads(response.body.decode('utf8'))
    item = response.meta['item']
    for comment in data.get('data', []):
        c_item = CommentsItem()
        c_item['id_'] = comment.get('Id')
        ...
        item['comments'].append(c_item)
    yield item