这是我的表
<table class="DataRows" frame="myFrames" rules="Standard" width="100%">
<colgroup><col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
</colgroup><thead>
<col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
<thead>
<tr>
<td valign="TOP"><span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
  
<span class="BOLD">E-mail:</span>
zoro@xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td></tr>
</thead>
</table>
我使用下面的代码循环遍历Html文档中的每个节点
foreach (HtmlNode node in htmlAgilityPackDoc.DocumentNode.SelectNodes("//table[contains(@class,'DataRows')]"))
{
}
当我使用以下
时node.SelectSingleNode(".//tr[1]/td[1]").InnerHtml
我得到以下html
<span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
  
<span class="BOLD">E-mail:</span>
zoro@xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td>
如何从中提取地址 120 NW 157TH AVE ?
当我尝试使用
时node.SelectSingleNode(".//td[@class='BOLD'][4]/preceding-sibling::td").InnerText;
我收到错误:
对象引用未设置为对象的实例
答案 0 :(得分:1)
你的html是乱七八糟的标签重叠我建议你使用文本节点作为你的标识符而不是索引例如
.//td[./a[contains(text(),'See on Map')]]/td/text()
获取
120 NW 157TH AVE
这是一个可以为您提供一切的完整示例
var table = doc.DocumentNode.SelectSingleNode("//table[contains(@class,'DataRows')]");
var name = table.SelectSingleNode(".//td[@class='BOLD']/text()").InnerText.Trim();
var fax = table.SelectSingleNode(".//td[contains(text(),'Fax')]/td/text()").InnerText.Trim();
var email = table.SelectSingleNode(".//span[contains(text(),'E-mail')]/following-sibling::text()").InnerText.Trim();
var address = table.SelectSingleNode(".//td[./a[contains(text(),'See on Map')]]/td/text()").InnerText.Trim();
var city = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span").InnerText.Trim(',');
var zip = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span/following-sibling::text()").InnerText.Trim();
请注意,因为你的html有多乱,xpath必须一团糟,试图按索引访问tr
元素是行不通的,因为所有tr元素都是前一个tr
的子元素,普通表中.//tr[4]
的内容是.//tr/tr/tr/tr
。