我正在尝试从该网址中抓取第一个表格:
使用以下代码:
url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="top-player-stats-summary-grid"]')
该数据的值为{xml_nodeset (0)}
url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
read_html() %>%
html_nodes(css='.grid')
遇到相同的问题。
显然这可能是JavaScript问题-是否有快速的方法来提取相关数据?检查表条目似乎表明数据不是从其他地方导入的,而是被编码到页面中的,因此看来我应该能够从源代码中提取数据(对不起,我完全不了解HTML和JS的工作方式,因此我的问题可能没有道理。
答案 0 :(得分:2)
使用浏览器时,页面通过页面上运行的javascript动态更新内容。 rvest不会发生这种情况。但是,您可以在开发工具网络标签中观察xhr调用,该调用将以json返回此内容
require(httr)
require(jsonlite)
headers = c('user-agent' = 'Mozilla/5.0',
'accept' = 'application/json, text/javascript, */*; q=0.01',
'referer' = 'https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal',
'authority' = 'www.whoscored.com',
'x-requested-with' = 'XMLHttpRequest')
params = list(
'category' = 'summary',
'subcategory' = 'all',
'statsAccumulationType' = '0',
'isCurrent' = 'true',
'playerId' = '',
'teamIds' = '158',
'matchId' = '318578',
'stageId' = '',
'tournamentOptions' = '',
'sortBy' = '',
'sortAscending' = '',
'age' = '',
'ageComparisonType' = '',
'appearances' = '',
'appearancesComparisonType' = '',
'field' = '',
'nationality' = '',
'positionOptions' = '',
'timeOfTheGameEnd' = '',
'timeOfTheGameStart' = '',
'isMinApp' = '',
'page' = '',
'includeZeroValues' = '',
'numberOfPlayersToPick' = ''
)
r <- httr::GET(url = 'https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics', httr::add_headers(.headers=headers), query = params)
data <- jsonlite::fromJSON(content(r,as="text") )
print(data$playerTableStats)
通过data$playerTableStats
的{{1}}的内容的小样本。您将根据需要解析所需格式的信息。