将复杂的HTML表刮到R中的data.frame中

时间:2015-01-08 15:28:44

标签: r rvest

我正在尝试将美国最高法院大法官的维基百科数据加载到R:

library(rvest)

html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])

[1] "Wilson, JamesJames Wilson"       "Jay, JohnJohn Jay†"             
[3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."     
[5] "Rutledge, JohnJohn Rutledge"     "Iredell, JamesJames Iredell"  

问题是数据格式不正确。它不是出现在实际HTML表格中的名称(" James Wilson"),而是实际上出现了两次,一次是"姓氏,名字和#34;然后再次作为"名字姓氏"。

原因是每个实际上都包含一个不可见的:

<td style="text-align:left;" class="">
    <span style="display:none" class="">Wilson, James</span>
    <a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a>
</td>

对于具有数字数据的列也是如此。我猜这个额外的代码是排序HTML表所必需的。但是,我不清楚在尝试从R中的表创建data.frame时如何删除这些跨度。

2 个答案:

答案 0 :(得分:9)

也许就像这样

library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson"       "Jay, JohnJohn Jay†"              "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."     
# [5] "Rutledge, JohnJohn Rutledge"     "Iredell, JamesJames Iredel

removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson"    "John Jay†"       "William Cushing" "John Blair, Jr." "John Rutledge"   "James Iredell" 

答案 1 :(得分:4)

您可以使用rvest

library(rvest)

html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")%>%   
  html_nodes("span+ a") %>% 
  html_text()

它并不完美,所以你可能想要改进css选择器,但它会让你相当接近。