Question

我正在尝试从单个html表中抓取/提取数据：http://www.theplantlist.org/tpl/record/kew-419248和许多非常相似的页面。我最初尝试使用以下函数来读取表格，但它并不理想，因为我想将每个物种名称分成其组成部分（属/物种/种类/作者等）。

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

我使用SelectorGadget为每个要提取的表元素标识一个唯一的XPATH（不一定是最短的）：

对于属名：// [contains（concat（“”，@ class，“”），concat（“”，“同义词”，“”））] // [contains（concat（“”，@ class，“”），concat（“”，“genus”，“”））]

对于物种名称：// [contains（concat（“”，@ class，“”），concat（“”，“synonym”，“”））] // [contains（ concat（“”，@ class，“”），concat（“”，“species”，“”））]

对于infraspecies rank：// * [contains（concat（“”，@ class，“”），concat（“”，“infraspr”，“”））]

对于infraspecies名称：// * [contains（concat（“”，@ class，“”），concat（“”，“infraspe”，“”））]

对于置信度（图像）：// [contains（concat（“”，@ class，“”），concat（“”，“synonyms”，“”））] // img对于来源： // [contains（concat（“”，@ class，“”），concat（“”，“source”，“”））] //一个

我现在想要将信息提取到数据框/表中。

我尝试使用XML包的xpathSApply函数来提取部分数据：

e.g。对于种类的排名

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

然而，这种方法存在问题，因为数据存在差距（例如，只有表中的某些行具有亚种类等级，因此我返回的是表中三个等级的列表，没有间隙）。数据输出也是我无法附加到数据帧的类。

有谁知道从这个表中提取信息到数据帧的更好方法？

非常感谢任何帮助！

汤姆

Answer 1

这是另一种解决方案，它将每个物种名称分成其组成部分

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

它产生以下输出

 genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.LÃ¶ve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L

Answer 2

以下代码将您的表解析为矩阵。

注意事项：

置信度级别列为空白，因为这不是文本而是图像。如果这很重要，您应该能够检索图像位置并解析它。
存在一些编码问题（UTF-8字符在我的机器上转换为ASCII）。我还不知道如何解决这个问题。

代码：

library(XML)
library(RCurl)

baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))

结果：

     [,1]                                                [,2]      [,3] [,4]  
[1,] "Critesion chilense (Roem. & Schult.) Ã.LÃ¶ve" "Synonym" ""   "WCSP"
[2,] "Hordeum chilense var. chilense "                   "Synonym" ""   "TRO" 
[3,] "Hordeum cylindricum Steud. [Illegitimate]"         "Synonym" ""   "WCSP"
[4,] "Hordeum depauperatum Steud."                       "Synonym" ""   "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie"      "Synonym" ""   "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv."        "Synonym" ""   "WCSP"

使用XML / RCurl R包解析HTML表，而不使用readHTMLTable函数

2 个答案: