在R

时间:2019-07-18 00:17:16

标签: r web-scraping

我在抓取动态呈现的页面时遇到困难。我在这里尝试过类似的帖子以寻求答案,但是我对javascript知之甚少,以至于我听不懂。

我想在这里刮擦每个桌子:https://www.espn.com/golf/leaderboard/_/tournamentId/401056558

我已经完成了一个排行榜,但由于它们是动态呈现的,因此我无法弄清楚如何获取“玩家统计”和“课程统计”的表格。

我太不懂JavaScript,不知道从哪里开始。我读过V8是一个有用的软件包,但我不知道为什么。

# clears the R workspace
rm(list = ls())

# sets the working directory to the directory to the current working directory, which is where the
# output files will be printed
setwd(getwd())

# loads in xml2 for the read_html function
library(xml2)
# loads in rvest for the html_text function
library(rvest)
# for handling the dynamically rendered javascript
library(V8)

url <- 'https://www.espn.com/golf/leaderboard/_/tournamentId/401056558'
golf_webpage <- read_html(url)


# this block of code loads in and formats the leaderboard

# loads in the leaderboard data
leaderboard_text_html <- html_nodes(golf_webpage, '.Table2__td')
leaderboard_text <- html_text(leaderboard_text_html)

#creates a matrix with 10 rows for each of the 10 leaderboard columns and makes the number of columns the number of golfers
leaderboard <- matrix(leaderboard_text, nrow =10 , ncol = length(leaderboard_text)/10)

# transposes the matrix so each row is a golfer and each column is a leaderboard column
leaderboard <- t(leaderboard)

我想弄清楚如何切换到球员统计数据和课程统计数据表以读取它们。

编辑:我尝试将所有表读入表列表。它说有3张桌子,这是我想要的数量,但是只有最后一张(排行榜)可读。

# loads the xml2 library
library(xml2)

# loads in the espn golf webpage as html
golf_webpage <- 
read_html('https://www.espn.com/golf/leaderboard/_/tournamentId/401056558')

tables_list <- golf_webpage %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)

1 个答案:

答案 0 :(得分:0)

打开开发工具,然后在源网页上单击player statscourse stats的每个选项卡,您将看到以下返回json的API调用。

library(jsonlite)

stats <- jsonlite::read_json('https://site.web.api.espn.com/apis/site/v2/sports/golf/pga/leaderboard/players?region=uk&lang=en&event=401056558')
course <- jsonlite::read_json('https://site.web.api.espn.com/apis/site/v2/sports/golf/pga/leaderboard/course?region=uk&lang=en&event=401056558')