尝试提取被附加类似的一堆标签包围的信息。这些数据取代了“Comedy Nights live - Full Episodes”。我用了
response.xpath("//h3/span/text()").extract()
response.xpath('//*[@id="meta"]/h3/span/text()').extract()
查询被提取,但每次我得到一个空列表。通过命令访问数据可能会有一些错误,但作为初学者,我需要有关如何达到所需目标的指导。
<a id="meta" class="yt-simple-endpoint style-scope ytd-grid-playlist-renderer" href="/watch?v=q1XwumKHSg8&list=PLX18mvVSh-bz3qlgf-uomp8zktOG5Rdj3">
<h3 class="style-scope ytd-grid-playlist-renderer">
<span id="video-title" class="style-scope ytd-grid-playlist-renderer">
Comedy Nights Live - Full Episodes
</span>
</h3>
</a>
这是蜘蛛文件。
# -*- coding: utf-8 -*-
import scrapy
class YtubeSpider(scrapy.Spider):
name = 'ytube'
allowed_domains = ['www.youtube.com/user/KapilComedyNights/playlists']
start_urls = ['http://www.youtube.com/user/KapilComedyNights/playlists/']
def parse(self, response):
pass
scrapy,python 2.7!
答案 0 :(得分:0)
查看您的浏览器开发者工具页面的组成方式。你会看到Youtube正在使用AJAX。直接下载ajax数据并解析它们。还要注意匿名访问该网站的scrapy。
尝试关闭ajax = 0:
https://www.youtube.com/user/KapilComedyNights/playlists/?ajax=0&app=desktop
你得到不同的回应:
response.xpath('//div[@class="yt-lockup-ontent"]/h3/a/@title').extract()
[u'Comedy - Full Episodes',
u'Comedy - Audio',
u'Comedy Nights Live',
u'Comedy Nights with Kapil - Shorts',
u'Comedy Nights Live - Full Episodes',
u'COMEDY NIGHTS LIVE - FULL EPISODES',
u'Comedy Nights Bachao',
u'COMEDY NIGHTS BACHAO - FULL EPISODES',
u'Comedy Nights Bachao - Full Episodes']