您好,我是第一次尝试抓取https://socialblade.com/网站,以获取某个国家的mostviewed
和mostsubscribed
youtuber的频道ID。
我的操作方式是在主列表页面上点击youtuber的链接(例如https://socialblade.com/youtube/top/country/pk/mostsubscribed)。然后它将打开一个新页面,并且新打开的页面的最后一部分包含频道ID(例如https://socialblade.com/youtube/channel/UC4JCksJF76g_MdzPVBJoC3Q)。
这是我的代码:
import scrapy
class SocialBladeSpider(scrapy.Spider):
name = "socialblade"
def start_requests(self):
urls = [
'https://socialblade.com/youtube/top/country/pk/mostviewed',
'https://socialblade.com/youtube/top/country/pk/mostsubscribed'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse_url(self, response):
data = {
'url': response.url.split('/')[-1],
'displayName': response.css('div#YouTubeUserTopInfoBlockTop div h1::text').extract_first()
}
yield {
response.meta['country']: {
response.meta['key']: data
}
}
def parse(self, response):
key = response.url.split("/")[-1]
country = response.url.split("/")[-2]
for a in response.css('a[href^="/youtube/user/"]'):
request = scrapy.Request(url='https://socialblade.com' + a.css('::attr(href)').extract_first(), callback=self.parse_url)
request.meta['key'] = key
request.meta['country'] = country
yield request
问题是:删除这两个网址后,我应该总共获得500条记录。但是我只得到348条记录。我做了研发,但是找不到解决方法。
(请指导我解决此问题)