崎spider的蜘蛛找到一个“下一步”按钮,但没有另一个

时间:2019-03-23 11:06:03

标签: python-3.x scrapy

我正在写一个蜘蛛来抓取一个受欢迎的评论网站:-)这是我第一次尝试编写Scrapy蜘蛛。

顶层是餐厅列表(我称之为“顶层”),一次出现30个。我的蜘蛛会访问每个链接,然后“单击下一步”以获取下一个30,依此类推。这部分正在工作,因为我的输出确实包含数千家餐厅,而不仅仅是前30家。

然后,我希望它“单击”到每个餐厅页面的链接(“餐厅级别”),但这仅包含评论的截断版本,因此我希望它随后“单击”另一个级别(以“评论级别”),然后从那里抓取评论,并通过另一个“下一个”按钮一次显示5条评论。这是我从中提取内容的唯一“级别”-其他级别仅具有访问所需链接的链接,以获取我想要的评论和其他信息。

由于我正在获取所需的所有信息,因此大多数操作都在进行,但仅适用于每家餐厅的前5条评论。它不是在底部“审阅级别”上“查找”“下一步”按钮。

我试图在parse方法中更改命令的顺序,但除此之外,我还没想到!我的xpath很好,所以它必须与蜘蛛的结构有关。

我的蜘蛛看起来:

import scrapy
from scrapy.http import Request

class TripSpider(scrapy.Spider):

    name = 'tripadvisor'
    allowed_domains = ['tripadvisor.co.uk']
    start_urls = ['https://www.tripadvisor.co.uk/Restaurants-g187069-Manchester_Greater_Manchester_England.html']
    custom_settings = {
       'DOWNLOAD_DELAY': 1,
       # 'DEPTH_LIMIT': 3,
       'AUTOTHROTTLE_TARGET_CONCURRENCY': 0.5,
       'USER_AGENT': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
       # 'DEPTH_PRIORITY': 1,
       # 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
       # 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue'
    }

    def scrape_review(self, response):
        restaurant_name_review = response.xpath('//div[@class="wrap"]//span[@class="taLnk "]//text()').extract()
        reviewer_name = response.xpath('//div[@class="username mo"]//text()').extract()
        review_rating = response.xpath('//div[@class="wrap"]/div[@class="rating reviewItemInline"]/span[starts-with(@class,"ui_bubble_rating")]').extract()
        review_title = response.xpath('//div[@class="wrap"]//span[@class="noQuotes"]//text()').extract()
        full_reviews = response.xpath('//div[@class="wrap"]/div[@class="prw_rup prw_reviews_text_summary_hsx"]/div[@class="entry"]/p').extract()
        review_date = response.xpath('//div[@class="prw_rup prw_reviews_stay_date_hsx"]/text()[not(parent::script)]').extract()
        restaurant_name = response.xpath('//div[@id="listing_main_sur"]//a[@class="HEADING"]//text()').extract() * len(full_reviews)
        restaurant_rating = response.xpath('//div[@class="userRating"]//@alt').extract() * len(full_reviews)
        restaurant_review_count = response.xpath('//div[@class="userRating"]//a//text()').extract() * len(full_reviews)

        for rvn, rvr, rvt, fr, rd, rn, rr, rvc in zip(reviewer_name, review_rating, review_title, full_reviews, review_date, restaurant_name, restaurant_rating, restaurant_review_count):
            reviews_dict = dict(zip(['reviewer_name', 'review_rating', 'review_title', 'full_reviews', 'review_date', 'restaurant_name', 'restaurant_rating', 'restaurant_review_count'], (rvn, rvr, rvt, fr, rd, rn, rr, rvc)))
            yield reviews_dict
            # print(reviews_dict)

    def parse(self, response):
        ### The parse method is what is actually being repeated / iterated
        for review in self.scrape_review(response):
            yield review
            # print(review)

        # access next page of resturants
        next_page_restaurants = response.xpath('//a[@class="nav next rndBtn ui_button primary taLnk"]/@href').extract_first()
        next_page_restaurants_url = response.urljoin(next_page_restaurants)
        yield Request(next_page_restaurants_url)
        print(next_page_restaurants_url)

        # access next page of reviews
        next_page_reviews = response.xpath('//a[@class="nav next taLnk "]/@href').extract_first()
        next_page_reviews_url = response.urljoin(next_page_reviews)
        yield Request(next_page_reviews_url)
        print(next_page_reviews_url)

        # access each restaurant page:
        url = response.xpath('//div[@id="EATERY_SEARCH_RESULTS"]/div/div/div/div/a[@target="_blank"]/@href').extract()
        for url_next in url:
            url_full = response.urljoin(url_next)
            yield Request(url_full)

        # "accesses the first review to get to the full reviews (not the truncated versions)"
        first_review = response.xpath('//a[@class="title "]/@href').extract_first() # extract first used as I only want to access one of the links on this page to get down to "review level"
        first_review_full = response.urljoin(first_review)
        yield Request(first_review_full)
        # print(first_review_full)

1 个答案:

答案 0 :(得分:0)

您在课程值末尾缺少空格: enter image description here

尝试一下:

next_page_reviews = response.xpath('//a[@class="nav next taLnk "]/@href').extract_first()

以下是部分匹配类的一些技巧:https://docs.scrapy.org/en/latest/topics/selectors.html#when-querying-by-class-consider-using-css

顺便说一句,您可以定义单独的解析函数,以更清楚地说明每个函数负责什么:https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=callback#more-examples-and-patterns