我的Scrapy Crawler没有找到嵌套的href标签

时间:2017-02-14 13:13:27

标签: python xpath web-scraping scrapy

嗨,我有一个令人伤心的爬虫如下:

import sys, getopt
import scrapy
from scrapy.spiders import Spider
from scrapy.http    import Request
import re

class TutsplusItem(scrapy.Item):
  title = scrapy.Field()



class MySpider(Spider):
  name = "tutsplus"
  allowed_domains   = ["bbc.com"]
  start_urls = ["http://www.bbc.com/"]
  crawling_level=None

  def __init__(self,crawling_level, *args):
      MySpider.crawling_level=crawling_level
      super(MySpider, self).__init__(self)



  def parse(self, response):
    links = response.xpath('//a/@href').extract()
    print("Links are %s" %links)
    print ("Crawling level is %s " %MySpider.crawling_level )




    # We stored already crawled links in this list
    level=MySpider.crawling_level
    crawledLinks = []

    # Pattern to check proper link
    # I only want to get the tutorial posts
   # linkPattern = re.compile("^\/tutorials\?page=\d+")




    for link in links:
      # If it is a proper link and is not checked yet, yield it to the Spider
      #if linkPattern.match(link) and not link in crawledLinks:
      if not link in crawledLinks and level>0:
        link = "http://www.bbc.com" + link
        crawledLinks.append(link)
        yield Request(link, self.parse)



    titles = response.xpath('//a[contains(@class, "media__link")]/@*').extract()
    #titles = response.xpath('//a/@href').extract()
    print ("Titles are %s" %titles )

    count=0
    for title in titles:
      item = TutsplusItem()
      item["title"] = title
      print("Title is : %s" %title)
      yield item

但是,我的代码存在问题

为行

titles = response.xpath('//a[contains(@class, "media__link")]').extract()

它不会返回任何链接。 HTNL如下:

<h3 class="media__title">
                        <a class="media__link" href="/news/world-us-canada-38965557"
                                  rev="hero1|headline" >
                                                            Trump adviser quits over Russia contacts                                                    </a>
                    </h3>

我的输出图块始终为空。我的XPATH有什么问题吗? 谢谢你的帮助

1 个答案:

答案 0 :(得分:0)

xpath不正确! 使用chrome dev工具进行xpath调试: enter image description here

"//a[@class='media__link']/@href"

titles = response.xpath('//a[@class='media__link']/@href').extract()