如何使用Html Agility Pack从网页中抓取值

时间:2016-12-02 16:16:09

标签: c# html web-scraping html-agility-pack

我需要网页上的一些值,所以我正在使用html敏捷包构建一个抓取。

我会告诉你html网站和我的Csharp。

Html网站:

  <div class="box-overflow">
    <div class="box-overflow__in">
      <table class="table-main js-tablebanner-t js-tablebanner-ntb">
        <tr>
          <th class="h-text-left" colspan="2">17. Round</th>

          <th class="h-text-center">1</th>

          <th class="h-text-center">X</th>

          <th class="h-text-center">2</th>

          <th>&nbsp;</th>
        </tr>

        <tr>
          <td class="h-text-left"><a href=
          "/soccer/poland/ekstraklasa/lechia-gdansk-leczna/Kjnscb6D/" class=
          "in-match"><span>Lechia Gdansk</span> - <span>Leczna</span></a></td>

          <td class="h-text-center"><a href=
          "/soccer/poland/ekstraklasa/lechia-gdansk-leczna/Kjnscb6D/">3:0</a></td>

          <td class="table-matches__odds colored"></td>

          <td class="table-matches__odds" data-odd="4.04"></td>

          <td class="table-matches__odds" data-odd="6.29"></td>

          <td class="h-text-right h-text-no-wrap">28.11.2016</td>
        </tr>

        <tr>
          <td class="h-text-left"><a href=
          "/soccer/poland/ekstraklasa/plock-piast-gliwice/KrhILsqE/" class=
          "in-match"><span>Plock</span> - <span>Piast Gliwice</span></a></td>

          <td class="h-text-center"><a href=
          "/soccer/poland/ekstraklasa/plock-piast-gliwice/KrhILsqE/">0:0</a></td>

          <td class="table-matches__odds" data-odd="2.05"></td>

          <td class="table-matches__odds colored"></td>

          <td class="table-matches__odds" data-odd="3.50"></td>

          <td class="h-text-right h-text-no-wrap">27.11.2016</td>
        </tr>

        <tr>
          <td class="h-text-left"><a href=
          "/soccer/poland/ekstraklasa/slask-wroclaw-legia/bZjMK1bK/" class=
          "in-match"><span>Slask Wroclaw</span> - <span>Legia</span></a></td>

          <td class="h-text-center"><a href=
          "/soccer/poland/ekstraklasa/slask-wroclaw-legia/bZjMK1bK/">0:4</a></td>

          <td class="table-matches__odds" data-odd="4.53"></td>

          <td class="table-matches__odds" data-odd="3.64"></td>

          <td class="table-matches__odds colored"></td>

          <td class="h-text-right h-text-no-wrap">27.11.2016</td>
        </tr>
      </table>
    </div>
  </div>

我的csharp:

 var url = "http://www.betexplorer.com/soccer/poland/ekstraklasa/results/";

        var web = new HtmlWeb();
        var doc = web.Load(url);

        Bets = new List<Bet>();



        // Lettura delle righe
        var Rows = doc.DocumentNode.SelectNodes("//table");

        foreach (var row in Rows)
        {
            if (!row.GetAttributeValue("class", "").Contains("table-main js-tablebanner-t js-tablebanner-ntb"))
            {
                if (string.IsNullOrEmpty(row.InnerText))
                    continue;

                var rowBet = new Bet();
                foreach (var node in row.ChildNodes)
                {
                    var data_odd = node.GetAttributeValue("data-odd", "");

                    if (string.IsNullOrEmpty(data_odd))
                    {
                        if (node.GetAttributeValue("class", "").Contains("in-match"))
                        {
                            rowBet.Match = node.InnerText.Trim();
                            var matchTeam = rowBet.Match.Split(new[] { " - " }, StringSplitOptions.RemoveEmptyEntries);
                            rowBet.Home = matchTeam[0];
                            rowBet.Host = matchTeam[1];
                        }


                        if (node.GetAttributeValue("class", "").Contains("h-text-center"))
                        {
                            rowBet.Result = node.InnerText.Trim();
                            var matchPoints = rowBet.Result.Split(new[] { ':' }, StringSplitOptions.RemoveEmptyEntries);
                            int help;
                            if (int.TryParse(matchPoints[0], out help))
                            {
                                rowBet.HomePoints = help;
                            }
                            if (matchPoints.Length == 2 && int.TryParse(matchPoints[1], out help))
                            {
                                rowBet.HostPoints = help;
                            }

                        }


                        if (node.GetAttributeValue("class", "").Contains("h-text-right h-text-no-wrap"))
                            rowBet.Date = node.InnerText.Trim();

                    }
                    else
                    {
                        rowBet.Odds.Add(data_odd);
                    }
                }

                if (!string.IsNullOrEmpty(rowBet.Match))
                    Bets.Add(rowBet);
            }
        }

我会给你更多的信息:

I need to take teams name (e.g. Lechia Gdansk - Leczna),
result (e.g. 3:0)
data-odd (e.g. 1.49, 4.04, 6.29)
and match date (e.g. 28.11.2016)

如果有人需要更多的信息,请问我想知道什么。感谢

1 个答案:

答案 0 :(得分:1)

我会这样做

var list =  doc.DocumentNode.SelectSingleNode("//table[@class='table-main js-tablebanner-t js-tablebanner-ntb']")
                .Descendants("tr")
                .Select(x => new
                {
                    Val1 = x.SelectSingleNode("td[@class='h-text-left']")?.InnerText,
                    Val2 = x.SelectSingleNode("td[@class='h-text-center']")?.InnerText
                })
                .Where(x => x.Val1!=null)
                .ToList();