Question

我正在使用scrapy从rottentomatoes.com抓取详细信息来创建蜘蛛。由于搜索页面是动态呈现的，因此我将rottentomatoes API用于例如：https://www.rottentomatoes.com/api/private/v2.0/search?q=inception以获取搜索结果和URL。通过scrapy跟随URL，我能够提取出Tomatometer得分，观众得分，导演，演员等。但是，我也想提取所有观众评论。问题在于，受众评论页面（https://www.rottentomatoes.com/m/inception/reviews?type=user）使用分页功能，我无法从下一页提取数据，而且我也找不到使用API提取细节的方法。谁能帮我这个忙。

    def parseRottenDetail(self, response):
    print("Reached Tomato Parser")
    try:
        if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
            items = TomatoCrawlerItem()
            MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['tomatometerScore'] = response.css(
                '.mop-ratings-wrap__row .mop-ratings-wrap__half .mop-ratings-wrap__percentage::text').get().strip()
            MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
                'tomatoAudienceScore'] = response.css(
                '.mop-ratings-wrap__row .mop-ratings-wrap__half.audience-score .mop-ratings-wrap__percentage::text').get().strip()
            MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
                'tomatoCriticConsensus'] = response.css('p.mop-ratings-wrap__text--concensus::text').get()
            if MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["type"] == "Movie":
                MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
                    "//ul[@class='content-meta info']/li[@class='meta-row clearfix']/div[contains(text(),'Directed By')]/../div[@class='meta-value']/a/text()").get()
            else:
                MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
                    "//div[@class='tv-series__series-info-castCrew']/div/span[contains(text(),'Creator')]/../a/text()").get()
            reviews_page = response.css('div.mop-audience-reviews__view-all a[href*="reviews"]::attr(href)').get()
            if len(reviews_page) != 0:
                yield response.follow(reviews_page, callback=self.parseRottenReviews)
            else:
                for key in MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse].keys():
                    if "pageURL" not in key and "type" not in key:
                        items[key] = MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][key]
                yield items
                if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
                    MoviecrawlSpider.current_parse += 1
                    print("Parse Values are Current Parse " + str(
                        MoviecrawlSpider.current_parse) + "and Total Results " + str(MoviecrawlSpider.total_results))
                    yield response.follow(MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["pageURL"],
                                          callback=self.parseRottenDetail)
    except Exception as e:
        exc_type, exc_obj, exc_tb = sys.exc_info()
        print(e)
        print(exc_tb.tb_lineno)

执行完这段代码后，我到达评论页面，例如https://www.rottentomatoes.com/m/inception/reviews?type=user，此后有一个下一个按钮，并使用分页加载下一页。那么，提取所有评论的方法应该是什么？

    def parseRottenReviews(self, response):
    print("Reached Rotten Review Parser")
    items = TomatoCrawlerItem()

Answer 1

当您转到下一页时，您会注意到它使用了页面的上一个结束光标值。您可以在第一次迭代中使用空字符串设置endCursor。另外请注意，您需要movieId才能获得评论，此ID可以从JS嵌入的json中提取：

import requests
import re
import json

r = requests.get("https://www.rottentomatoes.com/m/inception/reviews?type=user")
data = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))

movieId = data["movieId"]

def getReviews(endCursor):
    r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
    params = {
        "direction": "next",
        "endCursor": endCursor,
        "startCursor": ""
    })
    return r.json()

reviews = []
result = {}
for i in range(0, 5):
    print(f"[{i}] request review")
    result = getReviews(result["pageInfo"]["endCursor"] if i != 0  else "")
    reviews.extend([t for t in result["reviews"]])

print(reviews)
print(f"got {len(reviews)} reviews")

请注意，您也可以在第一次迭代时抓取html

Answer 2

在使用Scrapy时，我正在寻找一种无需使用请求模块即可执行此操作的方法。方法是相同的，但是我发现页面https://www.rottentomatoes.com/m/inception在root.RottenTomatoes.context.fandangoData标记中有一个对象<script>，它的键“ emsId”具有电影的ID，即传递给api以获取详细信息。浏览每个分页事件的“网络”选项卡时，我意识到他们使用startCursor和endCursor来过滤每个页面的结果。

pattern = r'\broot.RottenTomatoes.context.fandangoData\s*=\s*(\{.*?\})\s*;\s*\n'
                    json_data = response.css('script::text').re_first(pattern)
                    movie_id = json.loads(json_data)["emsId"]
{SpiderClass}.movieId = movie_id
    next_page = "https://www.rottentomatoes.com/napi/movie/" + movie_id + "/reviews/user?direction=next&endCursor=&startCursor="
                    yield response.follow(next_page, callback=self.parseRottenReviews)

对于第一次迭代，您可以将startCursor和endCursor参数留空。现在，您进入解析功能。您可以从当前响应中获取下一页的startCursor和endCursor参数。重复此操作，直到hasNextPage属性为false。

def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
current_result = json.loads(response.text)
for review in current_result["reviews"]:
    {SpiderClass}.reviews.append(review) #Spider class memeber So that it could be shared among iterations
if current_result["pageInfo"]["hasNextPage"] is True:
    next_page = "https://www.rottentomatoes.com/napi/movie/" + \
                str({SpiderClass}.movieId) + "/reviews/user?direction=next&endCursor=" + str(
        current_result["pageInfo"][
            "endCursor"]) + "&startCursor=" + str(current_result["pageInfo"]["startCursor"])
    yield response.follow(next_page, callback=self.parseRottenReviews)

现在{SpiderClass}.reviews数组将具有评论

如何使用Python抓取RottenTomatoes受众评论？

2 个答案: