Question

虽然我在这里看到过几个类似的问题，但似乎都没有确切地定义完成这项任务的过程。我主要借用位于here的Scrapy脚本，但由于它已经超过一年了，我不得不调整xpath引用。

我目前的代码如下：

import scrapy
from tripadvisor.items import TripadvisorItem

class TrSpider(scrapy.Spider):
    name = 'trspider'
    start_urls = [
        'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
        ]

def parse(self, response):
    for href in response.xpath('//div[@class="listing_title"]/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_hotel)

    next_page = response.xpath('//div[@class="unified pagination standard_pagination"]/child::*[2][self::a]/@href')
    if next_page:
        url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(url, self.parse)

def parse_hotel(self, response):
    for href in response.xpath('//div[starts-with(@class,"quote")]/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_review)

    next_page = response.xpath('//div[@class="unified pagination "]/child::*[2][self::a]/@href')
    if next_page:
        url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(url, self.parse_hotel)

def parse_review(self, response):
    item = TripadvisorItem()
    item['headline'] = response.xpath('translate(//div[@class="quote"]/text(),"!"," ")').extract()[0][1:-1]
    item['review'] = response.xpath('translate(//div[@class="entry"]/p,"\n"," ")').extract()[0]
    item['bubbles'] = response.xpath('//span[contains(@class,"ui_bubble_rating")]/@alt').extract()[0]
    item['date'] = response.xpath('normalize-space(//span[contains(@class,"ratingDate")]/@content)').extract()[0]
    item['hotel'] = response.xpath('normalize-space(//span[@class="altHeadInline"]/a/text())').extract()[0]
    return item

当以当前形式运行蜘蛛时，我抓了start_urls页面上列出的每个酒店的评论的第一页，但是分页没有翻到下一页的评论。从我怀疑，这是因为这一行：

next_page = response.xpath('//div[@class="unified pagination "]/child::*[2][self::a]/@href')

由于这些页面是动态加载的，因此当前页面上的下一页面不存在href。进一步调查我已经读过这些请求正在使用POST发送XHR请求。通过浏览Firefox中的"Network"标签＆＃34; Inspect＆＃34;根据关于同一主题的SO上的其他帖子，我可以看到翻译页面可能需要的Request URL和Form Data。

但是，在尝试使用Scrapy传递FormRequest时，似乎其他帖子引用了静态URL起点。使用TripAdvisor时，网址将始终根据我们正在查看的酒店名称进行更改，因此在使用FormRequest提交表单数据时，我不确定如何选择网址：{{ 1}}（此表格数据也似乎从一页到另一页似乎没有变化）。

或者，似乎没有办法提取reqNum=1&changeSet=REVIEW_LIST标签"Network"中显示的网址。这些页面做的URL在页面之间有所不同，但是设置了这个方法，我似乎无法从源代码中提取它们。评论页面会通过递增"Request URL"的网址来更改，其中-orXX-是一个数字。例如：

"XX"

所以，我的问题是，是否可以使用XHR请求/表单数据进行分页，还是需要为每个添加https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html的酒店手动构建URL列表？

Answer 1

好吧，我最终发现了一条显然允许评论分页的xpath，但它很有趣，因为每次我检查底层HTML时，即使我是href链接也从未改变过引用/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html例如，在第10页上。看来＆＃34; -orXX - ＆＃34;链接的一部分总是将XX增加5，所以我不确定为什么会这样。

我所做的就是改变这条线： next_page = response.xpath('//div[@class="unified pagination "]/child::*[2][self::a]/@href')

于： next_page = response.xpath('//link[@rel="next"]/@href')

并且有> 41K提取的评论。在其他情况下，我希望得到其他人对处理这个问题的看法。

Scrapy XHR分页在TripAdvisor上

1 个答案: