高级HTML Agility Pack用法

时间:2011-01-05 18:28:55

标签: c# screen-scraping html-agility-pack

我是HTML Agility Pack的新手,所以我需要一些帮助来解决下一步的问题。我可以做一些简单的事情,比如从一个href中提取一个值(知道我正在寻找的url字符串),并且我可以根据正在使用的特定类来调整范围中的值。但是我不明白如何在有大量或标签的情况下使用HTML Agility Pack,而这并不是一个真正的固定锚点?

这是我正在搜索的实际代码块。我在单元格中放置了虚拟数据来演示我在寻找什么。

提取以下内容的最佳方法是什么:

1。)公司名称?

2。)电话号码?

3.。)电子邮件地址?

... HTML

<td>
  <!-- Company Info -->
  <table cellpadding="0" cellspacing="0" border="0">
    <tr>
      <td class="black">
        <table cellspacing="1" cellpadding="0" border="0" width="370">
          <tr>
            <th>COMPANY NAME</th>
          </tr>
          <tr>
            <td class="search">
              <table cellpadding="5" cellspacing="0" border="0" width="100%">
                <tr>
                  <td>
                    <table cellpadding="1" cellspacing="0" border="0" width="100%">
                      <tr>
                        <td colspan="2" align="center">Un-needed Links...</td>
                      </tr>
                      <tr>
                        <td align="center" colspan="2"><hr></td>
                      </tr>
                      <tr>
                        <td align="right" nowrap>
                          <b>
                            <font color="FF0000">
                              Contact Person&nbsp;
                              <img src="/images/icon_contact.gif" align="absmiddle">&nbsp;:
                            </font>
                          </b>
                        </td>
                        <td align="left" width="100%">&nbsp;Judy Smith</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap>
                        <b><font color="FF0000">Phone Number&nbsp;<img src="/images/icon_phone.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;555-555-5555</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">E-mail Address&nbsp;<img src="/images/icon_email.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;<a HREF="mailto:judy.smith@companyname.com">judy.smith@companyname.com</a></td>
                      </tr>
                      <tr>
                        <td align="center" colspan="2"><hr></td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Home Office Location&nbsp;<img src="/images/icon_home.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;ATLANTA, GA</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Home Office Phone&nbsp;<img src="/images/icon_home.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;555-555-5555</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Home Office Fax&nbsp;<img src="/images/icon_home.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;666-666-6666</td>
                      </tr>
                      <tr>
                        <td align="center" colspan="2"><hr></td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Broker MC Number&nbsp;<img src="/images/icon_number.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;123456</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Carrier MC Number&nbsp;<img src="/images/icon_number.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;654321</td>
                      </tr>
                    </table>
                  </td>
                </tr>
              </table>
            </td>
          </tr>
        </table>
      </td>
    </tr>
  </table>
  <br>

  <!-- Starting Point -->
  <table cellpadding="0" cellspacing="0" border="0">
    <tr>
      <td class="black">
        <table cellspacing="1" cellpadding="0" border="0" width="370">
          <tr>
            <th>Starting Point</th>
            <th>Available</th>
          </tr>
          <tr>
            <td class="search" width="270">&nbsp;<b>ABBEVILLE, GA&nbsp;</b></td>
            <td class="search" align="center" width="100"><span style="color: forestgreen">&nbsp;1/5/11&nbsp;</span></td>
          </tr>
        </table>
      </td>
    </tr>

  </table>
  <br>
  <!-- Destination Point -->
  <table cellpadding="0" cellspacing="0" border="0">
    <tr>
      <td class="black">
        <table cellspacing="1" cellpadding="0" border="0" width="370">
          <tr>
            <th>Destination Point</th>
            <th>Direction</th>
          </tr>
          <tr>
            <td class="search" width="270">&nbsp;<b>ATLANTA, GA&nbsp;</b></td>
            <td class="search" align="center" width="100"><span style="color: FF0000">&nbsp;&nbsp;</span></td>
          </tr>
        </table>
      </td>

    </tr>
  </table>
  <br>
  <!-- Truck Details -->
  <table cellpadding="0" cellspacing="0" border="0">
    <tr>
      <td class="black">
        <table cellspacing="1" cellpadding="0" border="0" width="370">
          <tr>
            <th>Truck Details</th>
          </tr>
          <tr>
            <td class="search">
              <table cellpadding="5" cellspacing="0" border="0">
                <tr>
                  <td>
                    <table cellpadding="0" cellspacing="0" border="0">
                      <tr>
                        <td align="right"><b>Date Posted&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;1/5/2011 10:34:48 AM</td>
                      </tr>
                      <tr>
                        <td align="right"><b>Quantity&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;1</td>
                      </tr>
                      <tr>
                        <td align="right"><b>Equipment Type&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;FT</td>
                      </tr>
                      <tr>
                        <td align="right"><b>Load Size&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;Full</td>
                      </tr>
                      <tr>
                        <td align="right" valign="top"><b>Special Information&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;</td>
                      </tr>
                    </table>
                  </td>
                </tr>
              </table>
            </td>
          </tr>
        </table>
      </td>
    </tr>
  </table>
  <br>
</td>

....更多HTML

1 个答案:

答案 0 :(得分:5)

嗯,您必须了解XPATH才能真正了解HTML敏捷包抓取功能:-)您可以在 XPATH examples 上开始使用Google。

关注屏幕抓取问题,棘手的部分是选择您认为最想要获得的信息的判别式xpath表达式。大多数情况下,不仅有一个解决方案,而且您必须准备好更新代码以坚持目标站点HTML演变。

因此,这是非常简单的表达方式之间的折衷,它们存在与不需要的文本相匹配的风险,以及过于歧视的表达方​​式,不能容忍被删除的HTML中的演变,并且存在与之无关的风险。

至于你的具体文字,这是一个很好的现实世界的例子,这是一个代码:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourText);

string companyName = doc.DocumentNode.SelectSingleNode("/td/table/tr/td/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);

// another way
companyName = doc.DocumentNode.SelectSingleNode("//td[@class='black']/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);

// a more advanced XPATH expression, means
// "Select a TD tag anywhere in the doc that has a preceding sibling of TD type with a B chid, with a FONT child with inner text starting with 'Phone Number'"
string phoneNumber = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'Phone Number')]").InnerText;
Console.WriteLine("phone Number=" + phoneNumber);

// same kind of story but go down the next A tag
string email = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'E-mail')]/a").InnerText;
Console.WriteLine("email=" + email);

PS :请注意HTML Agility Pack始终希望XPATH表达式中使用的标记为小写,即使它们不在原始HTML文本中也是如此。

如您所见,此处使用两个不同的表达式检索公司名称。它们都可以处理样本,但如果在中间的任何地方添加新标记,第一个就无法抗拒。第二个更具有前瞻性,但基于CSS类标记,也可能会发生变化。这总是一种权衡。

电话号码&amp;电子邮件类似,但显示XPATH的力量。