Question

在我的抓网项目中，我必须抓取https://www.national-football-teams.com/country/67/2018/France.html中的足球比赛数据为了导航以匹配来自上述url的数据，我必须遵循在URL中具有哈希值的超引用：

<a href="#matches" data-toggle="tab">Matches</a>event

以下链接的标准抓取机制：

  href = response.xpath("//a[contains(@href,'matches')]/@href").extract_first()
  href = response.urljoin(href)

将产生一个不会导致比赛数据的链接： https://www.national-football-teams.com/matches.html

我将不胜感激。由于我不擅长Web报废以及任何与Web开发有关的事情，因此高度认可更具体的建议和/或最少的工作示例。为了完整起见，这是我的scrapy-spider的完整代码：

import scrapy

class NationalFootballTeams(scrapy.Spider):
    name = "nft"

    start_urls = ['https://www.national-football-teams.com/continent/1/Europe.html']

    def parse(self, response):

        for country in response.xpath("//div[@class='row country-teams']/div[1]/ul/li/a"):
            cntry = country.xpath("text()").extract_first().strip()

            if cntry == 'France':
               href = country.xpath("@href").extract_first()

               yield response.follow(href, self.parse_country)


    def parse_country(self, response):
       href = response.xpath("//a[contains(@href,'matches')]/@href").extract_first()
       href = response.urljoin(href)
       print href
       yield scrapy.Request(url=href, callback=self.parse_matches)

    def parse_matches(self, response):
        print response.xpath("//tr[@class='win']").extract()

Answer 1

单击该链接时，不会加载新页面甚至新数据，该页面已经在html中，但是已隐藏。单击该链接将调用一些javascript，这些javascript隐藏当前标签并显示新标签。因此，获取数据时，您无需跟踪任何链接，而只需使用其他xpath查询。匹配数据在xpath //div[@id='matches']中。

如果href属性包含哈希符号，如何在scrapy中遵循超引用

1 个答案: