Question

Scrapy的新手，并尝试抓取一些简单的HTML表格。我在同一页面中的两个不同表中找到了一个具有相同架构的站点，但是这种刮擦方法在其中一种情况下有效，而在另一种情况下则无效。这是链接：https://fbref.com/en/comps/12/stats/La-Liga-Stats

有效的代码（第一个表，顶部的代码）：

select count(FIELD) / count(*) from 
'table_name'`

现在由于某种原因，当我尝试刮擦下表（使用相关的xPath选择器）时，它什么也不返回：

import scrapy


class PostSpider(scrapy.Spider):

    name = 'stats'

    start_urls = [
        'https://fbref.com/en/comps/12/stats/La-Liga-Stats',
    ]

    def parse(self, response):
       for row in response.xpath('//*[@id="stats_standard_squads"]//tbody/tr'):
           yield {
               'players': row.xpath('td[2]//text()').extract_first(),
               'possession': row.xpath('td[3]//text()').extract_first(),
               'played': row.xpath('td[4]//text()').extract_first(),
               'starts': row.xpath('td[5]//text()').extract_first(),
               'minutes': row.xpath('td[6]//text()').extract_first(),
               'goals': row.xpath('td[7]//text()').extract_first(),
               'assists': row.xpath('td[8]//text()').extract_first(),
               'penalties': row.xpath('td[9]//text()').extract_first(),
           }

这是我执行import scrapy class PostSpider(scrapy.Spider): name = 'stats' start_urls = [ 'https://fbref.com/en/comps/12/stats/La-Liga-Stats', ] def parse(self, response): for row in response.xpath('//*[@id="stats_standard"]//tbody/tr'): yield { 'player': row.xpath('td[2]//text()').extract_first(), 'nation': row.xpath('td[3]//text()').extract_first(), 'pos': row.xpath('td[4]//text()').extract_first(), 'squad': row.xpath('td[5]//text()').extract_first(), 'age': row.xpath('td[6]//text()').extract_first(), 'born': row.xpath('td[7]//text()').extract_first(), '90s': row.xpath('td[8]//text()').extract_first(), 'att': row.xpath('td[9]//text()').extract_first(), }时来自终端的日志：

scrapy crawl stats

这是什么原因？据我所知，这些表的结构相同。

Answer 1

问题是location / { proxy_cache off; proxy_pass http://gitlab-workhorse; }在源代码中不可用，请在实时HTML代码中的id="stats_standard"中查看。它可以作为注释代码。

尝试view-source:https://fbref.com/en/comps/12/stats/La-Liga-Stats。您需要使用正则表达式对其进行解析，也可以使用库response.css('.placeholder ::text').getall()。

from scrapy import Selector

Python Scrapy返回200但没有关闭蜘蛛

1 个答案: