使用readHTMLTable从https网页读取表格

时间:2016-08-20 19:55:30

标签: r

我安装了R 3.3.1并使用了RStudio 0.99.903。我试图通过以下网址将表格读入R:https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php

(我很清楚有一个下载按钮,但是,这对我来说不是一个选项)

去年我使用readHTMLTable函数很容易做到这一点。但是,在那个时候,该网站从使用http更改为https,这导致" XML内容不会成为XML"错误。

我尝试了这里建议的内容:get url table into a `data.frame` R-XML-RCurl

library(XML)
library(RCurl)
url <- getURL("https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php")
df <- readHTMLTable(URL, header = T)

获取URL函数返回一个对我来说基本上没有意义的大字符串,这意味着readHTMLTable无法正常工作(我得到一个列表,有几个数据框,但这些对我来说也毫无意义。 #39;用西班牙语观察事物,我不知道他们来自哪里):

>url
[1] "\r\n<!DOCTYPE html>\n<html lang=\"en\">\n\n<head>\n    <title>2016 QB Fantasy Football Rankings, QB Cheat Sheets, QB Draft / Draft Rankings</title>\n    <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n    <meta name=\"description\" content=\"Don&#8217;t trust any 1 fantasy football expert? We combine their rankings into 1 Expert Consensus Ranking. Our 2016 Draft QB rankings are updated daily.\">\n<link rel=\"canonical\" href=\"https://www.fantasypros.com/nfl/rankings/qb-cheatsheets.php\" />\n\n    <meta property=\"fb:pages\" content=\"184352014941166\"/>\n

它还有很多方面。

有人可以就如何让它发挥作用提出建议吗?

感谢。

1 个答案:

答案 0 :(得分:2)

从网址

获取html文件
library("httr")
library("XML")
URL <- "https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php"
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))

解析HTML文件

doc <- htmlParse(temp)

通过选择table元素class = "player-table"及其子tr元素class = 'mpb-player-'来构建XPath查询

xpexpr <- "//table[contains(@class, 'player-table')]/tbody/tr[contains(@class, 'mpb-player-')]"

从doc获取xpath表达式的节点列表

listofTableNodes <- getNodeSet(doc, xpexpr)
listofTableNodes

使用节点列表的xmlvalues创建数据框

df <- xmlToDataFrame(listofTableNodes, stringsAsFactors = FALSE)
# alternatively xpathSApply can be used to get the same data frame
# df <- xmlToDataFrame(xpathSApply(doc, xpexpr), stringsAsFactors = FALSE)

删除空列

df <- df[, seq(1, length(df), by = 2)]

添加列名

xpexpr <- "//table[contains(@class, 'player-table')]/thead/tr/th"
listofTableNodes <- getNodeSet(doc, xpexpr)
listofTableNodes
colnames(df) <- gsub("[\r\n ]*$", '', xmlSApply(listofTableNodes, xmlValue))

head(df)
#   Rank          Player (Team) Pos Bye Best Worst Avg Std Dev ADP vs. ADP
# 1    1     Antonio Brown PIT  WR1   8    1     5 1.3     0.8 1.0     0.0
# 2    2 Odell Beckham Jr. NYG  WR2   8    1     9 3.1     1.6 2.0     0.0
# 3    3       Julio Jones ATL  WR3  11    1     6 3.4     1.1 4.0    +1.0
# 4    4        Todd Gurley LA  RB1   8    1    11 4.5     2.3 3.0    -1.0
# 5    5     David Johnson ARI  RB2   9    1    19 6.1     3.5 6.0    +1.0
# 6    6   Adrian Peterson MIN  RB3   6    1    22 7.6     3.8 5.0    -1.0