我正在使用scrapy从rottentomatoes.com抓取详细信息来创建蜘蛛。由于搜索页面是动态呈现的,因此我将rottentomatoes API用于例如:https://www.rottentomatoes.com/api/private/v2.0/search?q=inception以获取搜索结果和URL。通过scrapy跟随URL,我能够提取出Tomatometer得分,观众得分,导演,演员等。但是,我也想提取所有观众评论。问题在于,受众评论页面(https://www.rottentomatoes.com/m/inception/reviews?type=user)使用分页功能,我无法从下一页提取数据,而且我也找不到使用API提取细节的方法。谁能帮我这个忙。
def parseRottenDetail(self, response):
print("Reached Tomato Parser")
try:
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
items = TomatoCrawlerItem()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['tomatometerScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoAudienceScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half.audience-score .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoCriticConsensus'] = response.css('p.mop-ratings-wrap__text--concensus::text').get()
if MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["type"] == "Movie":
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//ul[@class='content-meta info']/li[@class='meta-row clearfix']/div[contains(text(),'Directed By')]/../div[@class='meta-value']/a/text()").get()
else:
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//div[@class='tv-series__series-info-castCrew']/div/span[contains(text(),'Creator')]/../a/text()").get()
reviews_page = response.css('div.mop-audience-reviews__view-all a[href*="reviews"]::attr(href)').get()
if len(reviews_page) != 0:
yield response.follow(reviews_page, callback=self.parseRottenReviews)
else:
for key in MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse].keys():
if "pageURL" not in key and "type" not in key:
items[key] = MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][key]
yield items
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
MoviecrawlSpider.current_parse += 1
print("Parse Values are Current Parse " + str(
MoviecrawlSpider.current_parse) + "and Total Results " + str(MoviecrawlSpider.total_results))
yield response.follow(MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["pageURL"],
callback=self.parseRottenDetail)
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
print(e)
print(exc_tb.tb_lineno)
执行完这段代码后,我到达评论页面,例如https://www.rottentomatoes.com/m/inception/reviews?type=user,此后有一个下一个按钮,并使用分页加载下一页。那么,提取所有评论的方法应该是什么?
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
items = TomatoCrawlerItem()
答案 0 :(得分:1)
当您转到下一页时,您会注意到它使用了页面的上一个结束光标值。您可以在第一次迭代中使用空字符串设置endCursor
。另外请注意,您需要movieId才能获得评论,此ID可以从JS嵌入的json中提取:
import requests
import re
import json
r = requests.get("https://www.rottentomatoes.com/m/inception/reviews?type=user")
data = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))
movieId = data["movieId"]
def getReviews(endCursor):
r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
params = {
"direction": "next",
"endCursor": endCursor,
"startCursor": ""
})
return r.json()
reviews = []
result = {}
for i in range(0, 5):
print(f"[{i}] request review")
result = getReviews(result["pageInfo"]["endCursor"] if i != 0 else "")
reviews.extend([t for t in result["reviews"]])
print(reviews)
print(f"got {len(reviews)} reviews")
请注意,您也可以在第一次迭代时抓取html
答案 1 :(得分:0)
在使用Scrapy时,我正在寻找一种无需使用请求模块即可执行此操作的方法。方法是相同的,但是我发现页面https://www.rottentomatoes.com/m/inception在root.RottenTomatoes.context.fandangoData
标记中有一个对象<script>
,它的键“ emsId”具有电影的ID,即传递给api以获取详细信息。浏览每个分页事件的“网络”选项卡时,我意识到他们使用startCursor和endCursor来过滤每个页面的结果。
pattern = r'\broot.RottenTomatoes.context.fandangoData\s*=\s*(\{.*?\})\s*;\s*\n'
json_data = response.css('script::text').re_first(pattern)
movie_id = json.loads(json_data)["emsId"]
{SpiderClass}.movieId = movie_id
next_page = "https://www.rottentomatoes.com/napi/movie/" + movie_id + "/reviews/user?direction=next&endCursor=&startCursor="
yield response.follow(next_page, callback=self.parseRottenReviews)
对于第一次迭代,您可以将startCursor
和endCursor
参数留空。现在,您进入解析功能。您可以从当前响应中获取下一页的startCursor
和endCursor
参数。重复此操作,直到hasNextPage
属性为false。
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
current_result = json.loads(response.text)
for review in current_result["reviews"]:
{SpiderClass}.reviews.append(review) #Spider class memeber So that it could be shared among iterations
if current_result["pageInfo"]["hasNextPage"] is True:
next_page = "https://www.rottentomatoes.com/napi/movie/" + \
str({SpiderClass}.movieId) + "/reviews/user?direction=next&endCursor=" + str(
current_result["pageInfo"][
"endCursor"]) + "&startCursor=" + str(current_result["pageInfo"]["startCursor"])
yield response.follow(next_page, callback=self.parseRottenReviews)
现在{SpiderClass}.reviews
数组将具有评论