陷入了从一堆标签中抓取信息的困难

时间:2018-04-15 18:18:35

标签: python-2.7 web-scraping scrapy

尝试提取被附加类似的一堆标签包围的信息。这些数据取代了“Comedy Nights live - Full Episodes”。我用了

response.xpath("//h3/span/text()").extract()
response.xpath('//*[@id="meta"]/h3/span/text()').extract()

查询被提取,但每次我得到一个空列表。通过命令访问数据可能会有一些错误,但作为初学者,我需要有关如何达到所需目标的指导。

<a id="meta" class="yt-simple-endpoint style-scope ytd-grid-playlist-renderer" href="/watch?v=q1XwumKHSg8&amp;list=PLX18mvVSh-bz3qlgf-uomp8zktOG5Rdj3">
  <h3 class="style-scope ytd-grid-playlist-renderer">
      <span id="video-title" class="style-scope ytd-grid-playlist-renderer">
        Comedy Nights Live - Full Episodes
    </span>
  </h3>
</a>

这是蜘蛛文件。

# -*- coding: utf-8 -*-
import scrapy


class YtubeSpider(scrapy.Spider):
    name = 'ytube'
    allowed_domains = ['www.youtube.com/user/KapilComedyNights/playlists']
    start_urls = ['http://www.youtube.com/user/KapilComedyNights/playlists/']

    def parse(self, response):
        pass

scrapy,python 2.7!

1 个答案:

答案 0 :(得分:0)

查看您的浏览器开发者工具页面的组成方式。你会看到Youtube正在使用AJAX。直接下载ajax数据并解析它们。还要注意匿名访问该网站的scrapy。

尝试关闭ajax = 0:

https://www.youtube.com/user/KapilComedyNights/playlists/?ajax=0&app=desktop

你得到不同的回应:

response.xpath('//div[@class="yt-lockup-ontent"]/h3/a/@title').extract()  
[u'Comedy - Full Episodes', 
u'Comedy - Audio', 
u'Comedy Nights Live', 
u'Comedy Nights with Kapil - Shorts',  
u'Comedy Nights Live - Full Episodes',  
u'COMEDY NIGHTS LIVE - FULL EPISODES',  
u'Comedy Nights Bachao',  
u'COMEDY NIGHTS BACHAO - FULL EPISODES',  
u'Comedy Nights Bachao - Full Episodes']