Question

我正在尝试从该网址中抓取第一个表格：

https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal

使用以下代码：

url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="top-player-stats-summary-grid"]')

该数据的值为{xml_nodeset (0)}

url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
  read_html() %>%
  html_nodes(css='.grid')

遇到相同的问题。

显然这可能是JavaScript问题-是否有快速的方法来提取相关数据？检查表条目似乎表明数据不是从其他地方导入的，而是被编码到页面中的，因此看来我应该能够从源代码中提取数据（对不起，我完全不了解HTML和JS的工作方式，因此我的问题可能没有道理。

Answer 1

使用浏览器时，页面通过页面上运行的javascript动态更新内容。 rvest不会发生这种情况。但是，您可以在开发工具网络标签中观察xhr调用，该调用将以json返回此内容

require(httr)
require(jsonlite)

headers = c('user-agent' = 'Mozilla/5.0',
            'accept' = 'application/json, text/javascript, */*; q=0.01',
           'referer' = 'https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal',
            'authority' = 'www.whoscored.com',
            'x-requested-with' = 'XMLHttpRequest')

params = list(
  'category' = 'summary',
  'subcategory' = 'all',
  'statsAccumulationType' = '0',
  'isCurrent' = 'true',
  'playerId' = '',
  'teamIds' = '158',
  'matchId' = '318578',
  'stageId' = '',
  'tournamentOptions' = '',
  'sortBy' = '',
  'sortAscending' = '',
  'age' = '',
  'ageComparisonType' = '',
  'appearances' = '',
  'appearancesComparisonType' = '',
  'field' = '',
  'nationality' = '',
  'positionOptions' = '',
  'timeOfTheGameEnd' = '',
  'timeOfTheGameStart' = '',
  'isMinApp' = '',
  'page' = '',
  'includeZeroValues' = '',
  'numberOfPlayersToPick' = ''
)

r <- httr::GET(url = 'https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics', httr::add_headers(.headers=headers), query = params)

data <- jsonlite::fromJSON(content(r,as="text") )
print(data$playerTableStats)

通过data$playerTableStats的{{1}}的内容的小样本。您将根据需要解析所需格式的信息。

网络抓取表格时出现{xml_nodeset（0）}问题

1 个答案: