使用XML包在R中进行Webscraping的麻烦

时间:2014-07-31 14:59:51

标签: xml r web-scraping

我已成功使用XML包来抓取多个网站,但我在此特定网页上创建数据框时遇到了问题:

library(XML)

url <- paste("http://www.foxsports.com/nfl/injuries?season=2013&seasonType=1&week=1", sep = "")
df1 <- readHTMLTable(url)

print(df1)

> print(df1)
$`NULL`
NULL

$`NULL`
NULL

$`NULL`
             Player Pos         Injury           Game Status
1       Dickson, Ed  TE          thigh              Probable
2      Jensen, Ryan   C           foot              Doubtful
3     Jones, Arthur  DE        illness                   Out
4   McPhee, Pernell  LB           knee              Probable
5     Pitta, Dennis  TE dislocated hip Injured Reserve (DFR)
6  Thompson, Deonte  WR           foot              Doubtful
7 Williams, Brandon  DT            toe              Doubtful

$`NULL`
           Player Pos        Injury Game Status
1  Anderson, C.J.  RB          knee         Out
2   Ayers, Robert  DE      Achilles    Probable
3   Bailey, Champ  CB          foot         Out
4     Clady, Ryan   T      shoulder    Probable
5  Dreessen, Joel  TE          knee         Out
6    Kuper, Chris   G         ankle    Doubtful
7 Osweiler, Brock  QB left shoulder    Probable
8     Welker, Wes  WR         ankle    Probable

$`NULL`

etc

如果我试图强迫它,我会收到此错误:

> df1 <- data.frame(readHTMLTable(url))
Error in data.frame(`NULL` = NULL, `NULL` = NULL, `NULL` = list(Player = 1:7,  : 
  arguments imply differing number of rows: 0, 7, 8, 6, 9, 1, 11, 4, 12, 5, 21, 3, 2, 15

我喜欢所有球队的所有伤病数据(球员,POS,伤害,比赛状态)。

提前致谢。

2 个答案:

答案 0 :(得分:2)

你只需要删除带有1列列表的空元素和表格&#34;没有报告伤害&#34;然后使用do.call进行rbind

n<-sapply(df1, function(x) !is.null(x) && ncol(x)==4)
x <-  do.call("rbind", df1[n])
rownames(x)<-NULL

答案 1 :(得分:1)

# Packages
require(XML)
require(RCurl)

# URL of interest
url <- paste("http://www.foxsports.com/nfl/injuries?season=2013&seasonType=1&week=1", sep = "")

# Parse HTML
doc <- htmlParse(url)

# Tables which are not nulls
df1 <- readHTMLTable(doc)
df.list <- df1[!as.vector(sapply(df1, is.null))]

# Get table names
table.names <- xpathSApply(doc, "//div[@class='wisfb_injuryHeader']", function(x) gsub("^\\s+|\\s+$", "", xmlValue(x)))

# Assign names
names(df.list) <- table.names


# $`San Diego Chargers`
# Player Pos                         Injury Game Status
# 1    Floyd, Malcom  WR                           knee    Probable
# 2   Ingram, Melvin  LB                  Torn left ACL  Day-to-Day
# 3    Liuget, Corey  DE                       shoulder    Probable
# 4  Patrick, Johnny  CB concussion, not injury related    Probable
# 5     Royal, Eddie  WR              chest, concussion    Probable
# 6  Taylor, Brandon   S                           knee    Probable
# 7      Te'o, Manti  LB                           foot         Out
# 8 Wright, Shareece  CB                          chest    Probable
# #[etc.]
编辑:刚看到@Spacedman在@Chris S的答案中给出的答案基本相同。