抓取时间表HTMLAgilityPack

时间:2012-09-11 08:59:19

标签: c# html html-agility-pack

我需要从网站上获取时间表。我想将此时间表存储/添加到我的C#应用​​程序中的数据表中。

数据表的结构如下所示:

1. |  Day  |  Time  | Status |
2. ..1.......7:00.........IN
3. ..1.......9:45.......OUT
4. ..1......10:15........IN
5. ..1......15:45......OUT
6. ..1.......8:45.....TOTAL
7. ..2      ..        ..

DataTable的我的C#代码:

DataTable table = new DataTable("Worksheet");
table.Columns.Add("Day");
table.Columns.Add("Time");
table.Columns.Add("Status");

我尝试了不同的变体,而且我总是搞砸了所有的数据。

出于测试目的,我创建了一个新的Winform,其中包含一个“文本框”(用于站点路径)和“按钮”(用于启动该过程)

然后我希望HTMLAgilityPack获取所有数据。一个例子:

public string[] GREYsource;

public Form1()
{
    InitializeComponent();
}

private void btnSubmit_Click(object sender, EventArgs e)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    var fileName = txtPath.Text;                    // I downloaded the HTML-File
    doc.Load(fileName);

    string strGREYInner;

    foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//tr[@class=\"tblDataGreyNH\"]"))
    {
        strGREYInner = td.InnerText.Trim();
        string shorted = strGREYInner.Replace("\t", ""); string shorted2 = shorted.Replace("\n\n\n\n", "\n\n\n"); string shorted3 = shorted2.Replace("\n\n\n", "\n\n"); string shorted4 = shorted3.Replace("\n\n", "\n");
        GREYsource = shorted4.Split(new Char[] { '\n', });
    }

    foreach (string str in GREYsource)
    {
        ...
    }
}
  1. 问题:结果包含很多标签(/ t)和换行符(/ n)我需要修剪。
  2. 问题:这不是一个好方法,IMO。这只会抓住Totaltimes。
  3. 可以做得更好。

    这只是我试过的一个例子(其他代码只是一堆垃圾)

    我附上了以下HTML结构:

    概述(图片): http://www.abload.de/img/overviewzoj18.png

    更深入一点:

    <html>
      <head>
      </head>
      <style type="text/css">
      </style>
      <body id="body" onload="handleMenuOverlapLogo();onload_column_expand();;firstElementFocus();">
    
        <.. some (java)scripts>             /* has to be ignoered. not necessary */
        <.. some other divs>              /* has to be ignoered. not necessary */
        <div id="rowContent">             /* This <div> contains the content i need */
          <div id="titleTab">             /* Title is not necessary */
          </div>                    
          <div id="rowContentInner">          /* Here the content starts */
            <table class="tblList">
              <tbody>
                <tr>              /* not necessary */
                <tr class="tblHeader">      /* not necessary */
                <tr class="tblHeader">      /* not necessary */
                <tr class="tblDataWhiteNH">   /*  IN : */
                  <td class="tblHeader" style="font-weight: bold; text-align: right"> In </td>
                  <td nowrap="">        /* "tblDataWhiteNH" always contains 7 "td nowrap"
                  <td nowrap="">
                  <td nowrap="">        /* Example: if it contains a value */
                    <table width="100%" border="0" align="center">
                    <tbody>
                        <tr>
                          <td width="25%" align="left"> </td>
                          <td nowrap="" width="50%" align="center"> 7:53 </td>  /* value = 7:53 (THIS!) */
                          <td width="25%" align="right"> </td>
                        </tr>
                      </tbody>
                    </table>
                  </td>
                  <td nowrap="">
                  <td nowrap="">        /* Example: if it contains no value */
                    <table width="100%" border="0" align="center">
                      <tbody>
                        <tr>
                          <td width="25%" align="left"> </td>
                          <td nowrap="" width="50%" align="center">       /* no value = 0:00 (THIS!) */
                          <td width="25%" align="right"> </td>
                        </tr>
                      </tbody>
                    </table>
                  </td>
                  <td nowrap="">
                  <td nowrap="">
                <tr class="tblDataWhiteNH">   /* OUT : */
                  <td class="tblHeader" style="font-weight: bold; text-align: right"> Out </td>
                  <td nowrap="">        /* "tblDataWhiteNH" always contains 7 "td nowrap".
                  <td nowrap="">
                  <td nowrap="">        /* Example: if it contains a value */
                    <table width="100%" border="0" align="center">
                    <tbody>
                        <tr>
                          <td width="25%" align="left"> </td>
                          <td nowrap="" width="50%" align="center"> 7:53 </td>  /* value = 7:53 (THIS!) */
                          <td width="25%" align="right"> </td>
                        </tr>
                      </tbody>
                    </table>
                  </td>
                  <td nowrap="">
                  <td nowrap="">        /* Example: if it contains no value */
                    <table width="100%" border="0" align="center">
                      <tbody>
                        <tr>
                          <td width="25%" align="left"> </td>
                          <td nowrap="" width="50%" align="center">       /* no value = 0:00 (THIS!) */
                          <td width="25%" align="right"> </td>
                        </tr>
                      </tbody>
                    </table>
                  </td>
                  <td nowrap="">
                  <td nowrap="">
                <tr class="tblDataGreyNH">    /*  IN : */
                <tr class="tblDataGreyNH">    /* OUT : */
                ...               /* "tblDataGreyNH" is built up the same way like "tblDataWhiteNH".
                ...               /* sometimes there could be more "tblDataWhiteNH" and "tblDataGreyNH". */
                ...               /* Usally there are just the "tblDataWhiteNH"(IN/OUT) */
                <tr class="tblHeader">      /* not necessary */
                                /* It continues f.egs. with "tblDataWhite" if the last above header was a "tblDatagrey" */
                                /* and versa vice ("grey" if there was a "white" before.) */
                <tr class="tblDataWhiteNH">   /* Worked : */
                  <td class="tblHeader" style="font-weight: bold; text-align: right"> Total Time </td>
                  <td> 07:47 </td>      /* value = 7:47 (THIS!) */
                  <td> 04:48 </td>      
                  <td> 00:00 </td>      /* no value = 0:00 (THIS!) */
                  <td> 00:00 </td>      
                  <td> 07:42 </td>      
                  <td> 00:00 </td>      
                  <td> 00:00 </td>      
                </tr>
                <tr class="tblDataGreyNH">    /* Total : */
                  <td class="tblHeader" style="font-weight: bold; text-align: right"> Regular Time </td>
                  <td> 07:47 </td>      /* value = 7:47 (THIS!) */
                  <td> 04:48 </td>      
                  <td> </td>          /* no value = 0:00 (THIS!) */
                  <td> </td>          
                  <td> 07:42 </td>      
                  <td> </td>
                  <td> </td>
                </tr>
                <tr class="tblHeader">      /* not necessary */
                <tr valign="top">       /* not necessary */
              </tbody>
            </table>
          </div>
        </div>
      </body>
    </html>
    

    原始HTML的副本:http://time.wnb.dk/123/

    我希望有人能帮助我实现这个目标。


    好的,让我用照片解释一下。 https://www.abload.de/img/eeeqnuwu.png
    在图片上,您可以看到网站+下表,结果如何。

    声明数据表不是问题 主要的问题是我不能让htmlagility吐出正确的结果,如果确实如此,它几乎是错误的。 我试过的一些选择节点让输出在一段时间后搞砸了。到目前为止,我还无法从网站上的表格中获取“全部”数据,只有一些价值,但往往是错误的 所以我实际上正在寻找可以看一眼的人,也许可以帮我找到正确的选择节点。

1 个答案:

答案 0 :(得分:1)

我不确定我完全理解你想要做什么,但这里有一个示例代码可以帮助你入门。我强烈建议您查看XPATH以了解它。

        HtmlDocument doc = new HtmlDocument();
        doc.Load(yourFile);

        // get all TR with a specific class name, starting from root (/), and recursively (//)
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//tr[@class='tblDataGreyNH' or @class='tblDataWhiteNH']"))
        {
            // get all TD below the current node with a specific class name
            HtmlNode inOrOut = node.SelectSingleNode("td[@class='tblHeader']");
            if (inOrOut != null)
            {
                string io = inOrOut.InnerText.Trim();
                Console.WriteLine(io.ToUpper());
                if (io.Contains("Time"))
                {
                    // normalize-space gets rid or whitespaces (\r,\n, etc.)
                    // text() gets the node's inner text
                    foreach (HtmlNode td in node.SelectNodes("td[normalize-space(@class)='' and normalize-space(text())!='' and normalize-space(text())!='00:00']"))
                    {
                        Console.WriteLine("value:" + td.InnerText.Trim());
                    }
                }
            }

            // gets all TD below the current node that define the NOWRAP attribute
            HtmlNodeCollection tdNoWraps = node.SelectNodes("td[@nowrap]"); 
            if (tdNoWraps != null)
            {
                foreach (HtmlNode tdNoWrap in tdNoWraps)
                {
                    string value = tdNoWrap.InnerText.Trim();
                    if (value == string.Empty)
                        continue;

                    Console.WriteLine("value:" + value);
                }
            }
        }

它将从您的示例页面输出:

IN
value:7:47
value:7:46
value:7:45
value:7:51
OUT
value:15:35
value:15:33
value:12:38
value:8:59
IN
value:12:38
value:8:59
OUT
value:15:35
TOTAL TIME
value:07:48
value:07:47
value:07:50
value:01:08
REGULAR TIME
value:07:48
value:07:47
value:07:50
value:01:08