我有一个解析HTML行的函数:
xmlToCsv <- function(xml) {
a <- gsub('\n\n','\t', xmlValue(xml))
b <- gsub('\t\t','\t \t', a)
d <- gsub('\t\t','\t', b)
e <- gsub('^ |\t$','', d)
f <- gsub('\t ','\t', e)
cc <- c("numeric", "numeric", "character", "character", "character", "character", "character", "character", "character")
cn <- c("EngNumber", "JpNumber", "Icon", "EngSet", "JpSet", "EngCardCount", "JpCardCount", "EngDate", "JpDate")
g <- read.table(text=f, sep="\t", header=FALSE)
colnames(g) <- cn
keeps <- c("EngNumber", "EngSet", "EngCardCount")
return(g[keeps])
}
我使用这个函数:
library(RCurl)
library(XML)
theurl <- "http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_Trading_Card_Game_expansions"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
tr <- getNodeSet(pagetree, "//*/tr")
sets <- tr[4:length(tr)-1]
dsets <- sapply(sets, xmlToCsv)
sets变量保存我感兴趣的HTML表的每一行。例如,sets [1]看起来像:
sets[1]
[[1]]
<tr><th> 1
</th>
<td> 1
</td>
<td>
</td>
<td> <a href="/wiki/Base_Set_(TCG)" title="Base Set (TCG)">Base Set</a>
</td>
<td> Expansion Pack
</td>
<td> 102
</td>
<td> 102
</td>
<td> January 9, 1999
</td>
<td> October 20, 1996
</td></tr>
dsets的内容比我知道的更复杂。我想要一个看起来像这样的数据框:
EngNumber Icon EngSet EngCardCount
1 58 NA Legendary Treasures 138
2 61 NA Furious Fists 113
我关门了吗?我有正确的方法吗?我对R很新,并且不胜感激。
答案 0 :(得分:1)
我认为您的问题来自于使用sapply
而不是lapply
。 sapply
尝试简化您的结果,同时您希望rbind
xmlToCsv
的每个值。您可以使用data.frame
和lapply
do.call
dsets <- lapply(sets, xmlToCsv)
do.call(rbind,dsets)