Scrapy:无法从另一个页面获取价值

时间:2020-12-26 16:49:19

标签: python scrapy

我昨天开始使用 Scrapy,遵循 Scrapy 的这个修改版本:https://github.com/prncc/steam-scraper 来获取 Steam 评论信息。现有代码允许连续滚动,直到没有评论可刮为止。但是,我需要稍微修改一下才能从另一个页面获取值;更具体地说,例如在像 https://steamcommunity.com/app/416600/reviews 这样的网页上,我想获得每个评论者的评论数量,这些评论仅显示在他们的评论页面上(比如这个 https://steamcommunity.com/profiles/76561197993023168/recommended/,有 14评论)。

原代码如下:

class ReviewSpider(scrapy.Spider):
name = 'reviews'
test_urls = [
    # Full Metal Furies
    'http://steamcommunity.com/app/416600/reviews/?browsefilter=mostrecent&p=1',
]

def __init__(self, url_file=None, steam_id=None, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.url_file = url_file
    self.steam_id = steam_id

def read_urls(self):
    with open(self.url_file, 'r') as f:
        for url in f:
            url = url.strip()
            if url:
                yield scrapy.Request(url, callback=self.parse)

def start_requests(self):
    if self.steam_id:
        url = (
            f'http://steamcommunity.com/app/{self.steam_id}/reviews/'
            '?browsefilter=mostrecent&p=1'
        )
        yield Request(url, callback=self.parse)
    elif self.url_file:
        yield from self.read_urls()
    else:
        for url in self.test_urls:
            yield Request(url, callback=self.parse)

def parse(self, response):
    page = get_page(response)
    product_id = get_product_id(response)

    # Load all reviews on current page.
    reviews = response.css('div .apphub_Card')
    for i, review in enumerate(reviews):
        yield load_review(review, product_id, page, i)
        


    # Navigate to next page.
    form = response.xpath('//form[contains(@id, "MoreContentForm")]')
    if form:
        yield self.process_pagination_form(form, page, product_id)

def process_pagination_form(self, form, page=None, product_id=None):
    action = form.xpath('@action').extract_first()
    names = form.xpath('input/@name').extract()
    values = form.xpath('input/@value').extract()

    formdata = dict(zip(names, values))
    meta = dict(prev_page=page, product_id=product_id)

    return FormRequest(
        url=action,
        method='GET',
        formdata=formdata,
        callback=self.parse,
        meta=meta
    )

我试图做的是在解析函数中添加这个,只是为了获取给定用户的评论数量:

def parse(self, response):
    page = get_page(response)
    product_id = get_product_id(response)

    # Load all reviews on current page.
    reviews = response.css('div .apphub_Card')
    for i, review in enumerate(reviews):
        yield load_review(review, product_id, page, i)
        Reviewers = response.xpath("/html/body/div[1]/div[5]/div[5]/div/div[1]/div/div/a[1]") #Get the path for each reviewer
        for IndividualReview in Reviewers:
            num_reviews = IndividualReview.xpath(".//@href").get()
            yield {
                'num_reviews': num_reviews
            }


    # Navigate to next page.
    form = response.xpath('//form[contains(@id, "MoreContentForm")]')
    if form:
        yield self.process_pagination_form(form, page, product_id)

但是没有用。主要问题是我一般不熟悉 xpath,而且我真的不明白 Scrapy 应该如何转到另一个页面,获取所需的信息然后返回,对给定游戏的每个评论都进行迭代。我该如何解决这个问题?

0 个答案:

没有答案