Question

我用R解析了一个HTML文档，如下所示：

<table>
<b>title1</b>
<tr>row1</tr>
</table>
<table>
<b>title2</b>
<tr>row2</tr>
<tr>row3</tr>
</table>

我想用R解析我的HTML文档，以便有这样一个表：

title    |    value
title1   |    row1
title2   |    row2
title2   |    row3

我已尝试使用此代码：

doc<-htmlParse(html_document)
titles<-sapply(getNodeSet(doc,"//table//b"), function(x) xmlValue(x))
values<-sapply(getNodeSet(doc,"//table//tr"), function(x) xmlValue(x))

但它不起作用，因为标题由2个不同的值（title1和title2）和3个不同的值（row1，row2和row3）组成，我不能将row1与title1和row2和row3 with title2。

我确定有解决方案，但我无法找到它。你可以帮帮我吗？感谢。

Answer 1

这是一个非常难看的答案，它不使用XPath，但是......它有效：

tabledir=getNodeSet(doc,"//table")
#returns a list of all the nodes inside the successive <table>
parsed=matrix(nrow=0,ncol=2)
indice=NULL
for(i in tabledir){
 if(grepl("<b>",toString.XMLNode(i))){ #select "title" nodes
  indice=xmlValue(i) 
 }
 else
 {
  valeur=c(indice,xmlValue(i))
  parsed=rbind(valeur,parsed)
 }
}

丑陋不是吗？我仍然确定有一种方法可以使用XPath。

Answer 2

也许，它会帮助你

//table//*[name()='b' or name()='tr']

返回

Element='<b>title1</b>'
Element='<tr>row1</tr>'
Element='<b>title2</b>'
Element='<tr>row2</tr>'
Element='<tr>row3</tr>'

如何将列标题与带有R和XPath的HTML文档中的列值相关联？

2 个答案: