从HTML表中提取带有标题的数据

时间:2018-11-18 05:30:17

标签: c# html selenium-webdriver

这是一些示例html。我们要从中提取数据并保存在数据库中。 除了通过父级,使用ID,名称或类提取数据的最简单,最快的方法是什么。 我正在为此目的使用Selenium和C#,但我不明白如何从标签中提取数据。 如您所见,没有ID和名称可以找到标签。

<tr>
        <td height="87" valign="top">
            <table width="730" border="0" cellpadding="0" cellspacing="0">
                <tbody><tr>
                    <td width="78" height="87" style="border-top-width: 1px;    border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                        <img src="LogoWebBill.gif" width="78" height="86">
                    </td>
                    <td valign="top">
                        <table width="651" border="0" cellpadding="0" cellspacing="0">
                            <tbody><tr>
                                <td height="22" style="border-top-width: 1px;border-left-width: 1px;    border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;border-right-width: 1px;border-right-style: solid;border-right-color: #CC0000;">
                                    <p align="center" class="FieldCaption">
                                        <strong><font size="2">LAHORE ELECTRIC SUPPLY COMPANY - ELECTRICITY CONSUMER BILL(MDI)</font></strong></p>
                                </td>
                            </tr>
                            <tr>
                                <td height="18" style="border-left-width: 1px; border-left-style: solid; border-left-color: #CC0000; border-right-width: 1px;border-right-style: solid;border-right-color: #CC0000;">
                                    <div align="center">
                                        <p class="FieldCaption">
                                            http://www.lesco.gov.pk</p>
                                    </div>
                                </td>
                            </tr>
                            <tr>
                                <td valign="top">
                                    <table width="651" border="0" cellpadding="0" cellspacing="0">
                                        <tbody><tr class="FieldCaption">
                                            <td width="248" height="19" style="border-top-width: 1px;   border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="left">
                                                    &nbsp;CUSTOMER I.D.
                                                </div>
                                            </td>
                                            <td width="51" style="border-top-width: 1px;    border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;ED@</div>
                                            </td>
                                            <td width="86" style="border-top-width: 1px;    border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;BILL MONTH</div>
                                            </td>
                                            <td width="89" style="border-top-width: 1px;    border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;READING DATE</div>
                                            </td>
                                            <td width="89" class="FieldCaption" style="border-top-width: 1px;   border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;ISSUE DATE</div>
                                            </td>
                                            <td width="89" style="border-top-width: 1px;border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;border-right-width: 1px;border-right-style: solid;border-right-color: #CC0000;">
                                                <div align="center">
                                                    <font color="#0066ff">&nbsp;DUE DATE</font></div>
                                            </td>
                                        </tr>
                                        <tr>
                                            <td height="28" class="GeneralText" style="border-top-width: 1px;   border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="left">
                                                    &nbsp;2000125</div>
                                            </td>
                                            <td class="GeneralText" style="border-top-width: 1px;   border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;1.0%</div>
                                            </td>
                                            <td class="GeneralText" style="border-top-width: 1px;   border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;Oct 18</div>
                                            </td>
                                            <td class="GeneralText" style="border-top-width: 1px;   border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;02 NOV 18</div>
                                            </td>
                                            <td class="GeneralText" style="border-top-width: 1px;   border-left-width: 1px; border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;08 NOV 18</div>
                                            </td>
                                            <td class="GeneralText" style="border-top-width: 1px;border-left-width: 1px;    border-top-style: solid; border-left-style: solid; border-top-color: #CC0000; border-left-color: #CC0000;border-right-width: 1px;border-right-style: solid;border-right-color: #CC0000;">
                                                <div align="center">
                                                    &nbsp;23 11 2018</div>
                                            </td>
                                        </tr>
                                    </tbody></table>
                            </td></tr>
                        </tbody></table>
                    </td>
                </tr>
            </tbody></table>
        </td>
    </tr>

1 个答案:

答案 0 :(得分:0)

获取表格的innerHTML / externalHTML并使用HTML Parser提取

找到可识别的元素,然后获取元素的内部html以提取其html。

 string html = driver.FindElement(By.XPath("//img [@src='LogoWebBill.gif']/parent::td/following-sibling::td"").GetAttribute("innerHTML")

然后使用HTML解析器(Html Agility Pack)离线解析它

var doc = new HtmlDocument();
doc.LoadHtml(html);

var title = doc.DocumentNode
 .SelectNodes("//tbody/tr")
 .First()
 . InnerText;
// This returns LAHORE ELECTRIC SUPPLY COMPANY - ELECTRICITY CONSUMER BILL(MDI)
// Similarly find the header and rows data with loops