从HTML中抓取数据

时间:2015-01-30 16:13:43

标签: xml r dataframe scrape

这是我试图从中抓取的页面http://www.footballlocks.com/nfl_point_spreads_week_1.shtml,我希望最终得到一个包含4列的简单data.frame,以便我可以进行进一步的分析。我已经尝试过使用XML包但运气不好。谢谢你的帮助

week.1 <- readHTMLTable("http://www.footballlocks.com/nfl_point_spreads_week_1.shtml")
str(week.1) 

2 个答案:

答案 0 :(得分:3)

rvest可以做到这一点。您可以使用XPath来查找所有4列表:

library(rvest)

url <- "http://www.footballlocks.com/nfl_point_spreads_week_1.shtml"

pg <- html(url)

tabs <- pg %>% html_nodes(xpath="//table[@cols='4']")

html_table(tabs[[1]], header=TRUE)

##    Date & Time        Favorite Spread     Underdog
## 1  9/4 8:35 ET      At Seattle   -5.0    Green Bay
## 2  9/7 1:00 ET     New Orleans   -3.0   At Atlanta
## 3  9/7 1:00 ET    At St. Louis   -3.0    Minnesota
## 4  9/7 1:00 ET   At Pittsburgh   -6.0    Cleveland
## 5  9/7 1:00 ET At Philadelphia  -10.0 Jacksonville
## 6  9/7 1:00 ET      At NY Jets   -6.5      Oakland
## 7  9/7 1:00 ET    At Baltimore   -1.0   Cincinnati
## 8  9/7 1:00 ET      At Chicago   -7.0      Buffalo
## 9  9/7 1:00 ET      At Houston   -3.0   Washington
## 10 9/7 1:00 ET  At Kansas City   -3.0    Tennessee
## 11 9/7 1:00 ET     New England   -4.0     At Miami
## 12 9/7 4:25 ET    At Tampa Bay   -4.5     Carolina
## 13 9/7 4:25 ET   San Francisco   -3.5    At Dallas
## 14 9/7 8:30 ET       At Denver   -8.5 Indianapolis

如果需要像老学校那样踢它:

library(XML)

url <- "http://www.footballlocks.com/nfl_point_spreads_week_1.shtml"

doc <- htmlParse(url)

readHTMLTable(doc["//table[@cols='4']"][[1]])

(相同的输出)

答案 1 :(得分:0)

如果您想要实时NFL赔率,

Pinnacle Sports有一个API可以使用。也许更适合您的目的,而不是从该网页上刮掉一周的赔率;它是足球线分析的常用来源。