Question

嗨，我有一个令人伤心的爬虫如下：

import sys, getopt
import scrapy
from scrapy.spiders import Spider
from scrapy.http    import Request
import re

class TutsplusItem(scrapy.Item):
  title = scrapy.Field()



class MySpider(Spider):
  name = "tutsplus"
  allowed_domains   = ["bbc.com"]
  start_urls = ["http://www.bbc.com/"]
  crawling_level=None

  def __init__(self,crawling_level, *args):
      MySpider.crawling_level=crawling_level
      super(MySpider, self).__init__(self)



  def parse(self, response):
    links = response.xpath('//a/@href').extract()
    print("Links are %s" %links)
    print ("Crawling level is %s " %MySpider.crawling_level )




    # We stored already crawled links in this list
    level=MySpider.crawling_level
    crawledLinks = []

    # Pattern to check proper link
    # I only want to get the tutorial posts
   # linkPattern = re.compile("^\/tutorials\?page=\d+")




    for link in links:
      # If it is a proper link and is not checked yet, yield it to the Spider
      #if linkPattern.match(link) and not link in crawledLinks:
      if not link in crawledLinks and level>0:
        link = "http://www.bbc.com" + link
        crawledLinks.append(link)
        yield Request(link, self.parse)



    titles = response.xpath('//a[contains(@class, "media__link")]/@*').extract()
    #titles = response.xpath('//a/@href').extract()
    print ("Titles are %s" %titles )

    count=0
    for title in titles:
      item = TutsplusItem()
      item["title"] = title
      print("Title is : %s" %title)
      yield item

但是，我的代码存在问题

为行

titles = response.xpath('//a[contains(@class, "media__link")]').extract()

它不会返回任何链接。 HTNL如下：

<h3 class="media__title">
                        <a class="media__link" href="/news/world-us-canada-38965557"
                                  rev="hero1|headline" >
                                                            Trump adviser quits over Russia contacts                                                    </a>
                    </h3>

我的输出图块始终为空。我的XPATH有什么问题吗？谢谢你的帮助

Answer 1

xpath不正确！使用chrome dev工具进行xpath调试：

"//a[@class='media__link']/@href"

titles = response.xpath('//a[@class='media__link']/@href').extract()

我的Scrapy Crawler没有找到嵌套的href标签

1 个答案: