在R中如何从没有节点ID的HTML中提取父子层次结构

时间:2017-02-01 20:03:32

标签: html r parsing hierarchy

我有一个从包含实体层次结构的系统导出的HTML文件,并且它们可以将层次结构导出到我们的唯一方式是HTML格式。

我正在尝试提取实体名称,并使用" margin-left:"值或树结构告诉我缩进何时更改,无论我需要知道父级别何时更改。

我已尝试使用XML,XML2,RCurl和selectr将其分解为节点,但没有运气。

他们可以在浏览器中显示的文件如下所示: enter image description here

他们可以给我们的HTML代码如下:



<HTML><TITLE>My Hierarchy Title</TITLE><HEAD></HEAD><BODY style="background-color:white"><style type="text/css">
	thead {display:table-row-group;}
	tfoot {display:table-row-group;}
	</style>
<div style="color:000000;font-family:Arial;font-size:10;text-align:left;" border="1pt">
<table cellspacing="0" style="" border="0">
<thead style="color:000000;background-color:FFFF00;font-family:Arial;font-size:10;font-weight:bold;text-align:center;" border="1pt"></thead>
<tbody style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:left;" border="1pt">
<tr>
<td style="padding:20px">
<div style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:center;" border="1pt"><img src="C:\Local\Temp\Icon1000.bmp"></div>
</td>
<td style="padding:20px">
<div>Top Level Value - Not Indented - I want to Read into R</div>
</td>
</tr>
</tbody>
<tfoot style="color:000000;font-family:Arial;font-size:10;text-align:left;" border="1pt"></tfoot>
</table>
</div>
<div style="margin-left:1cm;">
<table cellspacing="0" style="" border="0">
<thead style="color:000000;background-color:FFFF00;font-family:Arial;font-size:10;font-weight:bold;text-align:center;" border="1pt"></thead>
<tbody style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:left;" border="1pt">
<tr>
<td style="padding:20px">
<div style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:center;" border="1pt"><img src="C:\Local\Temp\Icon1001.bmp"></div>
</td>
<td style="padding:20px">
<div>First Indented Value - 1 cm Indent - Parented by Top Level Value I want to read into R</div>
</td>
</tr>
</tbody>
<tfoot style="color:000000;font-family:Arial;font-size:10;text-align:left;" border="1pt"></tfoot>
</table>
</div>
<div style="margin-left:1cm;">
<table cellspacing="0" style="" border="0">
<thead style="color:000000;background-color:FFFF00;font-family:Arial;font-size:10;font-weight:bold;text-align:center;" border="1pt"></thead>
<tbody style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:left;" border="1pt">
<tr>
<td style="padding:20px">
<div style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:center;" border="1pt"><img src="C:\Local\Temp\Icon1002.bmp"></div>
</td>
<td style="padding:20px">
<div>Second Indented value also - 1 cm Indent - Parented by Top Level Value also to read into R</div>
</td>
</tr>
</tbody>
<tfoot style="color:000000;font-family:Arial;font-size:10;text-align:left;" border="1pt"></tfoot>
</table>
</div>
<div style="margin-left:2cm;">
<table cellspacing="0" style="" border="0">
<thead style="color:000000;background-color:FFFF00;font-family:Arial;font-size:10;font-weight:bold;text-align:center;" border="1pt"></thead>
<tbody style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:left;" border="1pt">
<tr>
<td style="padding:20px">
<div style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:center;" border="1pt"><img src="C:\Local\Temp\Icon1003.bmp"></div>
</td>
<td style="padding:20px">
<div>Third Value - 2 cm Indent - Parented by Second Indented Value seen above</div>
</td>
</tr>
</tbody>
<tfoot style="color:000000;font-family:Arial;font-size:10;text-align:left;" border="1pt"></tfoot>
</table>
</div>
<div style="margin-left:3cm;">
<table cellspacing="0" style="" border="0">
<thead style="color:000000;background-color:FFFF00;font-family:Arial;font-size:10;font-weight:bold;text-align:center;" border="1pt"></thead>
<tbody style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:left;" border="1pt">
<tr>
<td style="padding:20px">
<div style="color:000000;background-color:FFFFFF;font-family:Arial;font-size:10;text-align:center;" border="1pt"><img src="C:\Local\Temp\Icon1004.bmp"></div>
</td>
<td style="padding:20px">
<div>Fourth Value - 3 cm Indent - Parented by Third Value</div>
</td>
</tr>
</tbody>
<tfoot style="color:000000;font-family:Arial;font-size:10;text-align:left;" border="1pt"></tfoot>
</table>
</div>
<div style="margin-left:7cm;">
</BODY></HTML>
&#13;
&#13;
&#13;

到目前为止,我的示例代码如下所示:

library(xml2)
library(XML)
library(selectr)

ml <- read_html("~/R/Management Lines/rusample.htm")
ml.ls <- as_list(ml)
# list of length 2 returned - discard 1st list - keep second
ml.ls <- ml.ls[[2]]

# extract management line names from list object
divs <- seq(from=2, to=11, by=2)
for(i in 1:length(divs)){
  print(ml.ls[[divs[i]]][[2]][[2]]$tr[[3]]$div[[1]])
}

它返回这些确定 - 但是我无法弄清楚如何提取缩进并用这些结果对它们进行cbind以得到与这些名称相关的缩进级别:

> divs <- seq(from=2, to=11, by=2)
> for(i in 1:length(divs)){
+   print(ml.ls[[divs[i]]][[2]][[2]]$tr[[3]]$div[[1]])
+ }
[1] "Top Level Value - Not Indented - I want to Read into R"
[1] "First Indented Value - 1 cm Indent - Parented by Top Level Value I want to read into R"
[1] "Second Indented value also - 1 cm Indent - Parented by Top Level Value also to read into R"
[1] "Third Value - 2 cm Indent - Parented by Second Indented Value seen above"
[1] "Fourth Value - 3 cm Indent - Parented by Third Value"
> 

帮助!这让我抓狂,因为我知道这可能会那么难。

0 个答案:

没有答案