您好我正试图从premierleague
网站中提取该表格。
我使用的包是rvest
包,我在初始阶段使用的代码如下:
library(rvest)
library(magrittr)
premierleague <- read_html("https://fantasy.premierleague.com/a/entry/767830/history")
premierleague %>% html_nodes("ism-table")
我找不到一个能够为rvest包提取html_nodes
的html标签。
我使用类似的方法从“http://admissions.calpoly.edu/prospective/profile.html”中提取数据,我能够提取数据。我用于calpoly的代码如下:
library(rvest)
library(magrittr)
CPadmissions <- read_html("http://admissions.calpoly.edu/prospective/profile.html")
CPadmissions %>% html_nodes("table") %>%
.[[1]] %>%
html_table()
通过以下链接从youtube获取上述代码:https://www.youtube.com/watch?v=gSbuwYdNYLM&ab_channel=EvanO%27Brien
任何有关从fantasy.premierleague.com获取数据的帮助都非常感谢。我需要使用某种API吗?
答案 0 :(得分:7)
由于数据是使用JavaScript加载的,因此使用rvest获取HTML将无法满足您的需求,但如果您使用PhantomJS作为RSelenium中的无头浏览器,那么它并不是那么复杂(通过RSelenium标准):
library(RSelenium)
library(rvest)
# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()
# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]
# clean up
rd$close()
ptm$stop()
# parse with rvest
df <- html %>% read_html() %>%
html_node('#ismr-event-history table.ism-table') %>%
html_table() %>%
setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>% # clean column names
setNames(gsub('\\s', '_', names(.)))
str(df)
## 'data.frame': 20 obs. of 10 variables:
## $ Gameweek : chr "GW1" "GW2" "GW3" "GW4" ...
## $ Gameweek_Points : int 34 47 53 51 66 66 65 63 48 90 ...
## $ Points_Bench : int 1 6 9 7 14 2 9 3 8 2 ...
## $ Gameweek_Rank : chr "2,406,373" "2,659,789" "541,258" "905,524" ...
## $ Transfers_Made : int 0 0 2 0 3 2 2 0 2 0 ...
## $ Transfers_Cost : int 0 0 0 0 4 4 4 0 0 0 ...
## $ Overall_Points : chr "34" "81" "134" "185" ...
## $ Overall_Rank : chr "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
## $ Value : chr "£100.0" "£100.0" "£99.9" "£100.0" ...
## $ Change_Previous_Gameweek: logi NA NA NA NA NA NA ...
与往常一样,需要进行更多的清洁工作,但总的来说,如果没有太多的工作,它的状态会非常好。 (如果你正在使用tidyverse,df %>% mutate_if(is.character, parse_number)
会做得很好。)箭头是图像,这就是为什么最后一列都是NA
,但你仍然可以计算它们。
答案 1 :(得分:1)
此解决方案使用RSelenium和包XML
。它还假定您的RSelenium
工作安装可以正常使用firefox
。只需确保将firefox
启动脚本路径添加到PATH
。
如果您使用OS X
,则需要将/Applications/Firefox.app/Contents/MacOS/
添加到PATH
。或者,如果您使用的是Ubuntu计算机,则可能是/usr/lib/firefox/
。一旦您确定这是有效的,您可以使用以下内容继续使用R:
# Install RSelenium and XML for R
#install.packages("RSelenium")
#install.packages("XML")
# Import packages
library(RSelenium)
library(XML)
# Check and start servers for Selenium
checkForServer()
startServer()
# Use firefox as a browser and a port that's not used
remote_driver <- remoteDriver(browserName="firefox", port=4444)
remote_driver$open(silent=T)
# Use RSelenium to browse the site
epl_link <- "https://fantasy.premierleague.com/a/entry/767830/history"
remote_driver$navigate(epl_link)
elem <- remote_driver$findElement(using="class", value="ism-table")
# Get the HTML source
elemtxt <- elem$getElementAttribute("outerHTML")
# Use the XML package to work with the HTML source
elem_html <- htmlTreeParse(elemtxt, useInternalNodes = T, asText = TRUE)
# Convert the table into a dataframe
games_table <- readHTMLTable(elem_html, header = T, stringsAsFactors = FALSE)[[1]]
# Change the column names into something legible
names(games_table) <- unlist(lapply(strsplit(names(games_table), split = "\\n\\s+"), function(x) x[2]))
names(games_table) <- gsub("£", "Value", gsub("#", "CPW", gsub("Â","",names(games_table))))
# Convert the fields into numeric values
games_table <- transform(games_table, GR = as.numeric(gsub(",","",GR)),
OP = as.numeric(gsub(",","",OP)),
OR = as.numeric(gsub(",","",OR)),
Value = as.numeric(gsub("£","",Value)))
这应该产生:
GW GP PB GR TM TC OP OR Value CPW
GW1 34 1 2406373 0 0 34 2406373 100.0
GW2 47 6 2659789 0 0 81 2448674 100.0
GW3 53 9 541258 2 0 134 1914025 99.9
GW4 51 7 905524 0 0 185 1461665 100.0
GW5 66 14 379438 3 4 247 958889 100.1
GW6 66 2 303704 2 4 309 510376 99.9
GW7 65 9 138792 2 4 370 232474 99.8
GW8 63 3 108363 0 0 433 87967 100.4
GW9 48 8 1114609 2 0 481 75385 100.9
GW10 90 2 71210 0 0 571 27716 101.1
GW11 71 2 421706 3 4 638 16083 100.9
GW12 35 9 2798661 2 4 669 31820 101.2
GW13 41 8 2738535 1 0 710 53487 101.1
GW14 82 15 308725 0 0 792 29436 100.2
GW15 55 9 1048808 2 4 843 29399 100.6
GW16 49 8 1801549 0 0 892 35142 100.7
GW17 48 4 2116706 2 0 940 40857 100.7
GW18 42 2 3315031 0 0 982 78136 100.8
GW19 41 9 2600618 0 0 1023 99048 100.6
GW20 53 0 1644385 0 0 1076 113148 100.8
请注意,列CPW(从上周开始更改)是空字符串的向量。
我希望这会有所帮助。