我昨天开始使用 Scrapy,遵循 Scrapy 的这个修改版本:https://github.com/prncc/steam-scraper 来获取 Steam 评论信息。现有代码允许连续滚动,直到没有评论可刮为止。但是,我需要稍微修改一下才能从另一个页面获取值;更具体地说,例如在像 https://steamcommunity.com/app/416600/reviews 这样的网页上,我想获得每个评论者的评论数量,这些评论仅显示在他们的评论页面上(比如这个 https://steamcommunity.com/profiles/76561197993023168/recommended/,有 14评论)。
原代码如下:
class ReviewSpider(scrapy.Spider):
name = 'reviews'
test_urls = [
# Full Metal Furies
'http://steamcommunity.com/app/416600/reviews/?browsefilter=mostrecent&p=1',
]
def __init__(self, url_file=None, steam_id=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.url_file = url_file
self.steam_id = steam_id
def read_urls(self):
with open(self.url_file, 'r') as f:
for url in f:
url = url.strip()
if url:
yield scrapy.Request(url, callback=self.parse)
def start_requests(self):
if self.steam_id:
url = (
f'http://steamcommunity.com/app/{self.steam_id}/reviews/'
'?browsefilter=mostrecent&p=1'
)
yield Request(url, callback=self.parse)
elif self.url_file:
yield from self.read_urls()
else:
for url in self.test_urls:
yield Request(url, callback=self.parse)
def parse(self, response):
page = get_page(response)
product_id = get_product_id(response)
# Load all reviews on current page.
reviews = response.css('div .apphub_Card')
for i, review in enumerate(reviews):
yield load_review(review, product_id, page, i)
# Navigate to next page.
form = response.xpath('//form[contains(@id, "MoreContentForm")]')
if form:
yield self.process_pagination_form(form, page, product_id)
def process_pagination_form(self, form, page=None, product_id=None):
action = form.xpath('@action').extract_first()
names = form.xpath('input/@name').extract()
values = form.xpath('input/@value').extract()
formdata = dict(zip(names, values))
meta = dict(prev_page=page, product_id=product_id)
return FormRequest(
url=action,
method='GET',
formdata=formdata,
callback=self.parse,
meta=meta
)
我试图做的是在解析函数中添加这个,只是为了获取给定用户的评论数量:
def parse(self, response):
page = get_page(response)
product_id = get_product_id(response)
# Load all reviews on current page.
reviews = response.css('div .apphub_Card')
for i, review in enumerate(reviews):
yield load_review(review, product_id, page, i)
Reviewers = response.xpath("/html/body/div[1]/div[5]/div[5]/div/div[1]/div/div/a[1]") #Get the path for each reviewer
for IndividualReview in Reviewers:
num_reviews = IndividualReview.xpath(".//@href").get()
yield {
'num_reviews': num_reviews
}
# Navigate to next page.
form = response.xpath('//form[contains(@id, "MoreContentForm")]')
if form:
yield self.process_pagination_form(form, page, product_id)
但是没有用。主要问题是我一般不熟悉 xpath,而且我真的不明白 Scrapy 应该如何转到另一个页面,获取所需的信息然后返回,对给定游戏的每个评论都进行迭代。我该如何解决这个问题?