从R中的网站提取html表

时间:2017-01-05 23:30:54

标签: r html-table rvest

您好我正试图从premierleague网站中提取该表格。

我使用的包是rvest包,我在初始阶段使用的代码如下:

library(rvest)
library(magrittr)
premierleague <- read_html("https://fantasy.premierleague.com/a/entry/767830/history")
premierleague %>% html_nodes("ism-table")

我找不到一个能够为rvest包提取html_nodes的html标签。

我使用类似的方法从“http://admissions.calpoly.edu/prospective/profile.html”中提取数据,我能够提取数据。我用于calpoly的代码如下:

library(rvest)
library(magrittr)
CPadmissions <- read_html("http://admissions.calpoly.edu/prospective/profile.html")

CPadmissions %>% html_nodes("table") %>%
  .[[1]] %>%
  html_table()

通过以下链接从youtube获取上述代码:https://www.youtube.com/watch?v=gSbuwYdNYLM&ab_channel=EvanO%27Brien

任何有关从fantasy.premierleague.com获取数据的帮助都非常感谢。我需要使用某种API吗?

2 个答案:

答案 0 :(得分:7)

由于数据是使用JavaScript加载的,因此使用rvest获取HTML将无法满足您的需求,但如果您使用PhantomJS作为RSelenium中的无头浏览器,那么它并不是那么复杂(通过RSelenium标准):

library(RSelenium)
library(rvest)

# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()

# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]

# clean up
rd$close()
ptm$stop()

# parse with rvest
df <- html %>% read_html() %>% 
    html_node('#ismr-event-history table.ism-table') %>% 
    html_table() %>% 
    setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>%    # clean column names
    setNames(gsub('\\s', '_', names(.)))

str(df)
## 'data.frame':    20 obs. of  10 variables:
##  $ Gameweek                : chr  "GW1" "GW2" "GW3" "GW4" ...
##  $ Gameweek_Points         : int  34 47 53 51 66 66 65 63 48 90 ...
##  $ Points_Bench            : int  1 6 9 7 14 2 9 3 8 2 ...
##  $ Gameweek_Rank           : chr  "2,406,373" "2,659,789" "541,258" "905,524" ...
##  $ Transfers_Made          : int  0 0 2 0 3 2 2 0 2 0 ...
##  $ Transfers_Cost          : int  0 0 0 0 4 4 4 0 0 0 ...
##  $ Overall_Points          : chr  "34" "81" "134" "185" ...
##  $ Overall_Rank            : chr  "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
##  $ Value                   : chr  "£100.0" "£100.0" "£99.9" "£100.0" ...
##  $ Change_Previous_Gameweek: logi  NA NA NA NA NA NA ...

与往常一样,需要进行更多的清洁工作,但总的来说,如果没有太多的工作,它的状态会非常好。 (如果你正在使用tidyverse,df %>% mutate_if(is.character, parse_number)会做得很好。)箭头是图像,这就是为什么最后一列都是NA,但你仍然可以计算它们。

答案 1 :(得分:1)

此解决方案使用RSelenium和包XML。它还假定您的RSelenium工作安装可以正常使用firefox。只需确保将firefox启动脚本路径添加到PATH

如果您使用OS X,则需要将/Applications/Firefox.app/Contents/MacOS/添加到PATH。或者,如果您使用的是Ubuntu计算机,则可能是/usr/lib/firefox/。一旦您确定这是有效的,您可以使用以下内容继续使用R:

# Install RSelenium and XML for R
#install.packages("RSelenium")
#install.packages("XML")

# Import packages
library(RSelenium)
library(XML)

# Check and start servers for Selenium
checkForServer()
startServer()

# Use firefox as a browser and a port that's not used
remote_driver <- remoteDriver(browserName="firefox", port=4444)
remote_driver$open(silent=T)

# Use RSelenium to browse the site
epl_link <- "https://fantasy.premierleague.com/a/entry/767830/history"
remote_driver$navigate(epl_link)
elem <- remote_driver$findElement(using="class", value="ism-table")

# Get the HTML source
elemtxt <- elem$getElementAttribute("outerHTML")

# Use the XML package to work with the HTML source
elem_html <- htmlTreeParse(elemtxt, useInternalNodes = T, asText = TRUE)

# Convert the table into a dataframe
games_table <- readHTMLTable(elem_html, header = T, stringsAsFactors = FALSE)[[1]]

# Change the column names into something legible
names(games_table) <- unlist(lapply(strsplit(names(games_table), split = "\\n\\s+"), function(x) x[2]))
names(games_table) <- gsub("£", "Value", gsub("#", "CPW", gsub("Â","",names(games_table))))

# Convert the fields into numeric values
games_table <- transform(games_table, GR = as.numeric(gsub(",","",GR)),
                    OP = as.numeric(gsub(",","",OP)),
                    OR = as.numeric(gsub(",","",OR)),
                    Value = as.numeric(gsub("£","",Value)))

这应该产生:

 GW   GP PB GR     TM TC    OP   OR    Value CPW
 GW1  34 1  2406373 0  0    34 2406373 100.0    
 GW2  47 6  2659789 0  0    81 2448674 100.0    
 GW3  53 9   541258 2  0   134 1914025  99.9    
 GW4  51 7   905524 0  0   185 1461665 100.0    
 GW5  66 14  379438 3  4   247  958889 100.1    
 GW6  66 2   303704 2  4   309  510376  99.9    
 GW7  65 9   138792 2  4   370  232474  99.8    
 GW8  63 3   108363 0  0   433   87967 100.4    
 GW9  48 8  1114609 2  0   481   75385 100.9    
 GW10 90 2    71210 0  0   571   27716 101.1    
 GW11 71 2   421706 3  4   638   16083 100.9    
 GW12 35 9  2798661 2  4   669   31820 101.2    
 GW13 41 8  2738535 1  0   710   53487 101.1    
 GW14 82 15  308725 0  0   792   29436 100.2    
 GW15 55 9  1048808 2  4   843   29399 100.6    
 GW16 49 8  1801549 0  0   892   35142 100.7    
 GW17 48 4  2116706 2  0   940   40857 100.7    
 GW18 42 2  3315031 0  0   982   78136 100.8    
 GW19 41 9  2600618 0  0  1023   99048 100.6    
 GW20 53 0  1644385 0  0  1076  113148 100.8

请注意,列CPW(从上周开始更改)是空字符串的向量。

我希望这会有所帮助。