这是一些示例html。我们要从中提取数据并保存在数据库中。 除了通过父级,使用ID,名称或类提取数据的最简单,最快的方法是什么。 我正在为此目的使用Selenium和C#,但我不明白如何从标签中提取数据。 如您所见,没有ID和名称可以找到标签。
<tr>
<td height="87" valign="top">
<table width="730" border="0" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="78" height="87" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<img src="LogoWebBill.gif" width="78" height="86">
</td>
<td valign="top">
<table width="651" border="0" cellpadding="0" cellspacing="0">
<tbody><tr>
<td height="22" style="border-top-width: 1px;border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;border-right-width: 1px;border-right-style: solid;border-right-color: #CC0000;">
<p align="center" class="FieldCaption">
<strong><font size="2">LAHORE ELECTRIC SUPPLY COMPANY - ELECTRICITY CONSUMER BILL(MDI)</font></strong></p>
</td>
</tr>
<tr>
<td height="18" style="border-left-width: 1px; border-left-style: solid; border-left-color: #CC0000; border-right-width: 1px;border-right-style: solid;border-right-color: #CC0000;">
<div align="center">
<p class="FieldCaption">
http://www.lesco.gov.pk</p>
</div>
</td>
</tr>
<tr>
<td valign="top">
<table width="651" border="0" cellpadding="0" cellspacing="0">
<tbody><tr class="FieldCaption">
<td width="248" height="19" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="left">
CUSTOMER I.D.
</div>
</td>
<td width="51" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="center">
ED@</div>
</td>
<td width="86" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="center">
BILL MONTH</div>
</td>
<td width="89" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="center">
READING DATE</div>
</td>
<td width="89" class="FieldCaption" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="center">
ISSUE DATE</div>
</td>
<td width="89" style="border-top-width: 1px;border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;border-right-width: 1px;border-right-style: solid;border-right-color: #CC0000;">
<div align="center">
<font color="#0066ff"> DUE DATE</font></div>
</td>
</tr>
<tr>
<td height="28" class="GeneralText" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="left">
2000125</div>
</td>
<td class="GeneralText" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="center">
1.0%</div>
</td>
<td class="GeneralText" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="center">
Oct 18</div>
</td>
<td class="GeneralText" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="center">
02 NOV 18</div>
</td>
<td class="GeneralText" style="border-top-width: 1px; border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
<div align="center">
08 NOV 18</div>
</td>
<td class="GeneralText" style="border-top-width: 1px;border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;border-right-width: 1px;border-right-style: solid;border-right-color: #CC0000;">
<div align="center">
23 11 2018</div>
</td>
</tr>
</tbody></table>
</td></tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
答案 0 :(得分:0)
获取表格的innerHTML / externalHTML并使用HTML Parser提取
找到可识别的元素,然后获取元素的内部html以提取其html。
string html = driver.FindElement(By.XPath("//img [@src='LogoWebBill.gif']/parent::td/following-sibling::td"").GetAttribute("innerHTML")
然后使用HTML解析器(Html Agility Pack)离线解析它
var doc = new HtmlDocument();
doc.LoadHtml(html);
var title = doc.DocumentNode
.SelectNodes("//tbody/tr")
.First()
. InnerText;
// This returns LAHORE ELECTRIC SUPPLY COMPANY - ELECTRICITY CONSUMER BILL(MDI)
// Similarly find the header and rows data with loops