我试图通过一个没有超链接分页按钮的评论网站进行分页。我已经编写了逻辑来分页和硬编码每个链接的页数。但是,我想知道是否可以使用我抓取的信息作为start_requests
中给定链接的页数。
这里有蜘蛛代码(有2个链接到分页):
class TareviewsSpider(scrapy.Spider):
name = 'tareviews'
allowed_domains = ['tripadvisor.com']
# start_urls = []
def start_requests(self):
for page in range(0,395,5):
yield self.make_requests_from_url('https://www.tripadvisor.com/Hotel_Review-g60795-d102542-Reviews-or{}-Courtyard_Philadelphia_Airport-Philadelphia_Pennsylvania.html'.format(page))
for page in range(0,1645,5):
yield self.make_requests_from_url('https://www.tripadvisor.com/Hotel_Review-g60795-d122332-Reviews-or{}-The_Ritz_Carlton_Philadelphia-Philadelphia_Pennsylvania.html'.format(page))
def parse(self, response):
for idx,review in enumerate(response.css('div.review-container')):
item = {
'num_reviews': response.css('span.reviews_header_count::text')[0].re(r'\d{0,3}\,?\d{1,3}'),
'hotel_name': response.css('h1.heading_title::text').extract_first(),
'review_title': review.css('span.noQuotes::text').extract_first(),
'review_body': review.css('p.partial_entry::text').extract_first(),
'review_date': review.xpath('//*[@class="ratingDate relativeDate"]/@title')[idx].extract(),
'num_reviews_reviewer': review.css('span.badgetext::text').extract_first(),
'reviewer_name': review.css('span.scrname::text').extract(),
'bubble_rating': review.xpath("//div[contains(@class, 'reviewItemInline')]//span[contains(@class, 'ui_bubble_rating')]/@class")[idx].re(r'(?<=ui_bubble_rating bubble_).+?(?=0)')
}
yield item
'num_reviews'
是每个链接的最后一页的编号值。在for loop
start_requests
395
1645
和SELECT IF(priority_date, priority_date, created_at) as created_at, priority_date
FROM table
WHERE priority_date BETWEEN '2017-10-10' AND '2017-10-10 23:59:59'
OR (priority_date IS NULL AND created_at BETWEEN '2017-10-10' AND '2017-10-10 23:59:59');
。
这可能吗?如果可能的话,我想避免使用无头浏览器。谢谢!
答案 0 :(得分:1)
我制作了这段代码
我使用普通网址 - 没有-or{}
- 来获取网页并查找评论数量
接下来,我将-or{}
添加到网址 - 它可以在任何地方 - 为带有评论的网页生成网址
然后我使用for
循环和Request()
来获取包含评论的网页
评论通过不同的方法解析 - parse_reviews()
在代码中我使用scrapy.crawler.CrawlerProcess()
来运行它而没有完整的项目,
所以每个人都可以轻松地运行和测试它。
它会将数据保存在output.csv
import scrapy
class TareviewsSpider(scrapy.Spider):
name = 'tareviews'
allowed_domains = ['tripadvisor.com']
start_urls = [ # without `-or{}`
'https://www.tripadvisor.com/Hotel_Review-g60795-d102542-Reviews-Courtyard_Philadelphia_Airport-Philadelphia_Pennsylvania.html',
'https://www.tripadvisor.com/Hotel_Review-g60795-d122332-Reviews-The_Ritz_Carlton_Philadelphia-Philadelphia_Pennsylvania.html',
]
def parse(self, response):
# get number of reviews
num_reviews = response.css('span.reviews_header_count::text').extract_first()
num_reviews = num_reviews[1:-1] # remove `( )`
num_reviews = num_reviews.replace(',', '') # remove `,`
num_reviews = int(num_reviews) # convert to integer
print('num_reviews:', num_reviews, type(num_reviews))
# create template to generate urls to pages with reviews
url = response.url.replace('.html', '-or{}.html')
print('template:', url)
# add requests to list
for offset in range(0, num_reviews, 5):
print('url:', url.format(offset))
yield scrapy.Request(url=url.format(offset), callback=self.parse_reviews)
def parse_reviews(self, response):
print('reviews')
for idx,review in enumerate(response.css('div.review-container')):
item = {
'num_reviews': response.css('span.reviews_header_count::text')[0].re(r'\d{0,3}\,?\d{1,3}'),
'hotel_name': response.css('h1.heading_title::text').extract_first(),
'review_title': review.css('span.noQuotes::text').extract_first(),
'review_body': review.css('p.partial_entry::text').extract_first(),
'review_date': review.xpath('//*[@class="ratingDate relativeDate"]/@title')[idx].extract(),
'num_reviews_reviewer': review.css('span.badgetext::text').extract_first(),
'reviewer_name': review.css('span.scrname::text').extract(),
'bubble_rating': review.xpath("//div[contains(@class, 'reviewItemInline')]//span[contains(@class, 'ui_bubble_rating')]/@class")[idx].re(r'(?<=ui_bubble_rating bubble_).+?(?=0)')
}
yield item
# --- run without project ---
import scrapy.crawler
c = scrapy.crawler.CrawlerProcess({
"FEED_FORMAT": 'csv',
"FEED_URI": 'output.csv',
})
c.crawl(TareviewsSpider)
c.start())
BTW:获取您需要的网页
https://www.tripadvisor.com/g60795-d102542
https://www.tripadvisor.com/g60795-d102542-or0
https://www.tripadvisor.com/g60795-d102542-or5
网址中的其他字词仅适用于SEO
- 以便在Google搜索结果中获得更好的排名。