如何将已删除的项目添加到集合中并在满足条件时执行?

时间:2017-03-23 17:17:26

标签: python scrapy

这段代码应该将提取的reviewId添加到(为了省略重复。然后有一个检查,当设置lenth为100时 - 执行回调和长url字符串与所有ID都传递给主提取函数。

我如何使用内置工具或我拥有的代码将<保存所有ID,从不同的回调中提取到同一个Set并进一步使用)?现在的问题是,长时间检查循环永远不会被激活。 更新。我相信有两个选项 - 将Set as meta传递给每个回调,并以某种方式使用Item来实现这一个。但不知道怎么做。

import scrapy
from scrapy.shell import inspect_response



class QuotesSpider(scrapy.Spider):
    name = "tripad"
    list= set()

    def start_requests(self):
        url = "https://www.tripadvisor.com/Hotel_Review-g60763-d122005-Reviews-or{}-The_New_Yorker_A_Wyndham_Hotel-New_York_City_New_York.html#REVIEWS"

        for i in range(0,500,5):
            yield scrapy.Request(url=url.format(i), callback=self.parse)

    def parse(self, response):

        for result in response.xpath('//div[contains(@id,"review_")]/@id').extract():
            if "review" in result[:8]:
                QuotesSpider.list.add(result[7:] +"%2C")
            if len(QuotesSpider.list) == 100:
                url = "https://www.tripadvisor.com/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS&metaReferer=Hotel_Review&reviews="

                for i in QuotesSpider.list:
                    url+=i
                yield scrapy.Request(url=url, callback=self.parse_page)

1 个答案:

答案 0 :(得分:1)

有几种方法可以做到这一点,但我建议将蜘蛛分成两部分:

收集评论ID的蜘蛛

class CollectorSpider(Spider): 
    name='collect_reviews'
    def parse(self, response):
        review_ids = ...
        for review_id in review_ids:
            yield {'review_id': review_id}

Spider使用收集的评论ID来收集评论内容

class ConsumerSpider(Spider):
    name='consume_reviews'
    def start_requests(self):
        with open(self.file, 'r') as f:
            data = json.loads(f.read())
        last = 0
        for i in range(0, len(data), 100):
            ids = data[last:i]
            ids = [i['review_id'] for i in ids]
            # make url from ids
            url = ''
            yield Request(url)

    def parse(self, response):
        # crawl 100 reviews here