R:在因子而不是字符中解析HTML表结果

时间:2014-10-29 01:18:19

标签: r

我有一个解析HTML行的函数:

xmlToCsv <- function(xml) {
    a <- gsub('\n\n','\t', xmlValue(xml))
    b <- gsub('\t\t','\t \t', a)
    d <- gsub('\t\t','\t', b)
    e <- gsub('^ |\t$','', d)
    f <- gsub('\t ','\t', e)
    cc <- c("numeric", "numeric", "character", "character", "character", "character", "character", "character", "character")
    cn <- c("EngNumber", "JpNumber", "Icon", "EngSet", "JpSet", "EngCardCount", "JpCardCount", "EngDate", "JpDate")
    g <- read.table(text=f, sep="\t", header=FALSE)
    colnames(g) <- cn
    keeps <- c("EngNumber", "EngSet", "EngCardCount")
    return(g[keeps])
}

我使用这个函数:

library(RCurl)
library(XML)

theurl <- "http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_Trading_Card_Game_expansions"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
tr <- getNodeSet(pagetree, "//*/tr")
sets <- tr[4:length(tr)-1]
dsets <- sapply(sets, xmlToCsv)

sets变量保存我感兴趣的HTML表的每一行。例如,sets [1]看起来像:

sets[1]
[[1]]
<tr><th> 1
</th>
<td> 1
</td>
<td>
</td>
<td> <a href="/wiki/Base_Set_(TCG)" title="Base Set (TCG)">Base Set</a>
</td>
<td> Expansion Pack
</td>
<td> 102
</td>
<td> 102
</td>
<td> January 9, 1999
</td>
<td> October 20, 1996
</td></tr> 

dsets的内容比我知道的更复杂。我想要一个看起来像这样的数据框:

  EngNumber Icon             EngSet EngCardCount
1        58   NA Legendary Treasures         138
2        61   NA Furious Fists               113

我关门了吗?我有正确的方法吗?我对R很新,并且不胜感激。

1 个答案:

答案 0 :(得分:1)

我认为您的问题来自于使用sapply而不是lapplysapply尝试简化您的结果,同时您希望rbind xmlToCsv的每个值。您可以使用data.framelapply

获得do.call
     dsets <- lapply(sets, xmlToCsv)
     do.call(rbind,dsets)