这段代码应该将提取的reviewId添加到集(为了省略重复。然后有一个检查,当设置lenth为100时 - 执行回调和长url字符串与所有ID都传递给主提取函数。
我如何使用内置工具或我拥有的代码将<保存所有ID,从不同的回调中提取到同一个Set并进一步使用)?现在的问题是,长时间检查循环永远不会被激活。 更新。我相信有两个选项 - 将Set as meta传递给每个回调,并以某种方式使用Item来实现这一个。但不知道怎么做。
import scrapy
from scrapy.shell import inspect_response
class QuotesSpider(scrapy.Spider):
name = "tripad"
list= set()
def start_requests(self):
url = "https://www.tripadvisor.com/Hotel_Review-g60763-d122005-Reviews-or{}-The_New_Yorker_A_Wyndham_Hotel-New_York_City_New_York.html#REVIEWS"
for i in range(0,500,5):
yield scrapy.Request(url=url.format(i), callback=self.parse)
def parse(self, response):
for result in response.xpath('//div[contains(@id,"review_")]/@id').extract():
if "review" in result[:8]:
QuotesSpider.list.add(result[7:] +"%2C")
if len(QuotesSpider.list) == 100:
url = "https://www.tripadvisor.com/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS&metaReferer=Hotel_Review&reviews="
for i in QuotesSpider.list:
url+=i
yield scrapy.Request(url=url, callback=self.parse_page)
答案 0 :(得分:1)
有几种方法可以做到这一点,但我建议将蜘蛛分成两部分:
收集评论ID的蜘蛛
class CollectorSpider(Spider):
name='collect_reviews'
def parse(self, response):
review_ids = ...
for review_id in review_ids:
yield {'review_id': review_id}
Spider使用收集的评论ID来收集评论内容
class ConsumerSpider(Spider):
name='consume_reviews'
def start_requests(self):
with open(self.file, 'r') as f:
data = json.loads(f.read())
last = 0
for i in range(0, len(data), 100):
ids = data[last:i]
ids = [i['review_id'] for i in ids]
# make url from ids
url = ''
yield Request(url)
def parse(self, response):
# crawl 100 reviews here