我正在写一个蜘蛛来抓取一个受欢迎的评论网站:-)这是我第一次尝试编写Scrapy蜘蛛。
顶层是餐厅列表(我称之为“顶层”),一次出现30个。我的蜘蛛会访问每个链接,然后“单击下一步”以获取下一个30,依此类推。这部分正在工作,因为我的输出确实包含数千家餐厅,而不仅仅是前30家。
然后,我希望它“单击”到每个餐厅页面的链接(“餐厅级别”),但这仅包含评论的截断版本,因此我希望它随后“单击”另一个级别(以“评论级别”),然后从那里抓取评论,并通过另一个“下一个”按钮一次显示5条评论。这是我从中提取内容的唯一“级别”-其他级别仅具有访问所需链接的链接,以获取我想要的评论和其他信息。
由于我正在获取所需的所有信息,因此大多数操作都在进行,但仅适用于每家餐厅的前5条评论。它不是在底部“审阅级别”上“查找”“下一步”按钮。
我试图在parse方法中更改命令的顺序,但除此之外,我还没想到!我的xpath很好,所以它必须与蜘蛛的结构有关。
我的蜘蛛看起来:
import scrapy
from scrapy.http import Request
class TripSpider(scrapy.Spider):
name = 'tripadvisor'
allowed_domains = ['tripadvisor.co.uk']
start_urls = ['https://www.tripadvisor.co.uk/Restaurants-g187069-Manchester_Greater_Manchester_England.html']
custom_settings = {
'DOWNLOAD_DELAY': 1,
# 'DEPTH_LIMIT': 3,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 0.5,
'USER_AGENT': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
# 'DEPTH_PRIORITY': 1,
# 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
# 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue'
}
def scrape_review(self, response):
restaurant_name_review = response.xpath('//div[@class="wrap"]//span[@class="taLnk "]//text()').extract()
reviewer_name = response.xpath('//div[@class="username mo"]//text()').extract()
review_rating = response.xpath('//div[@class="wrap"]/div[@class="rating reviewItemInline"]/span[starts-with(@class,"ui_bubble_rating")]').extract()
review_title = response.xpath('//div[@class="wrap"]//span[@class="noQuotes"]//text()').extract()
full_reviews = response.xpath('//div[@class="wrap"]/div[@class="prw_rup prw_reviews_text_summary_hsx"]/div[@class="entry"]/p').extract()
review_date = response.xpath('//div[@class="prw_rup prw_reviews_stay_date_hsx"]/text()[not(parent::script)]').extract()
restaurant_name = response.xpath('//div[@id="listing_main_sur"]//a[@class="HEADING"]//text()').extract() * len(full_reviews)
restaurant_rating = response.xpath('//div[@class="userRating"]//@alt').extract() * len(full_reviews)
restaurant_review_count = response.xpath('//div[@class="userRating"]//a//text()').extract() * len(full_reviews)
for rvn, rvr, rvt, fr, rd, rn, rr, rvc in zip(reviewer_name, review_rating, review_title, full_reviews, review_date, restaurant_name, restaurant_rating, restaurant_review_count):
reviews_dict = dict(zip(['reviewer_name', 'review_rating', 'review_title', 'full_reviews', 'review_date', 'restaurant_name', 'restaurant_rating', 'restaurant_review_count'], (rvn, rvr, rvt, fr, rd, rn, rr, rvc)))
yield reviews_dict
# print(reviews_dict)
def parse(self, response):
### The parse method is what is actually being repeated / iterated
for review in self.scrape_review(response):
yield review
# print(review)
# access next page of resturants
next_page_restaurants = response.xpath('//a[@class="nav next rndBtn ui_button primary taLnk"]/@href').extract_first()
next_page_restaurants_url = response.urljoin(next_page_restaurants)
yield Request(next_page_restaurants_url)
print(next_page_restaurants_url)
# access next page of reviews
next_page_reviews = response.xpath('//a[@class="nav next taLnk "]/@href').extract_first()
next_page_reviews_url = response.urljoin(next_page_reviews)
yield Request(next_page_reviews_url)
print(next_page_reviews_url)
# access each restaurant page:
url = response.xpath('//div[@id="EATERY_SEARCH_RESULTS"]/div/div/div/div/a[@target="_blank"]/@href').extract()
for url_next in url:
url_full = response.urljoin(url_next)
yield Request(url_full)
# "accesses the first review to get to the full reviews (not the truncated versions)"
first_review = response.xpath('//a[@class="title "]/@href').extract_first() # extract first used as I only want to access one of the links on this page to get down to "review level"
first_review_full = response.urljoin(first_review)
yield Request(first_review_full)
# print(first_review_full)
答案 0 :(得分:0)
尝试一下:
next_page_reviews = response.xpath('//a[@class="nav next taLnk "]/@href').extract_first()
以下是部分匹配类的一些技巧:https://docs.scrapy.org/en/latest/topics/selectors.html#when-querying-by-class-consider-using-css
顺便说一句,您可以定义单独的解析函数,以更清楚地说明每个函数负责什么:https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=callback#more-examples-and-patterns