从动态列HtmlAgility包中选择页面数据

时间:2015-03-06 17:51:48

标签: c# html-agility-pack

我正在使用HtmlAgility包从这个网址中抓取数据: http://www.myfitnesspal.com/food/diary/chuckgross

基本上,我真正需要的唯一数据是卡路里,蛋白质,脂肪和碳水化合物。问题是这些列是用户订购的(用户甚至不能显示其中的一些!)。

我正在尝试将该页面数据返回到一个类中:

public class NutritionRecord
    {
        public string Calories { get; set; }
        public string Protein { get; set; }
        public string Fat { get; set; }
        public string Carbs { get; set; }
    }

我的想法是用列的名称(它的页脚)刮掉行,然后刮掉Totals行,然后将它们组合成一个新表,然后以某种方式弄清楚如何选择数据柱。我没有那么远。这就是我到目前为止的感觉,但我觉得我只是在挥舞着: http://pastebin.com/uYvMYuM3

此代码返回一个HTML表格,我无法弄清楚如何从列中获取数据。英语示例:在列标题为==“protein”的单元格中提供数据。

表格如下:

<table class='resultsTable'>
    <tr class='labels'>
        <th>Calories</th>

        <th>Protein</th>

        <th>Fat</th>

        <th>Carbs</th>

        <th>Fiber</th>
    </tr>

    <tr class='resultsTotals'>
        <td>2,386</td>

        <td>194</td>

        <td>109</td>

        <td>161</td>

        <td>38</td>
    </tr>
</table>

1 个答案:

答案 0 :(得分:1)

试试这个,你不需要废弃总数只是从下面的结果生成它们,这应该照顾隐藏和重新排序的列

 public class NutritionRecord
    {
        public string Meal { get; set; }
        public string MealPart { get; set; }
        public string Calories { get; set; }
        public string Protein { get; set; }
        public string Fat { get; set; }
        public string Carbs { get; set; }
        public string Fiber { get; set; }
        public string Sugar { get; set; }
    }

和刮擦部分:

       var html = new WebClient().DownloadString("http://www.myfitnesspal.com/food/diary/chuckgross");
        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(html);

        var list = new List<NutritionRecord>();
        var orderedColumnsList = doc.DocumentNode.SelectNodes("//tr[@class='meal_header']/td[@class='alt']").Select(td=>td.InnerText.Trim()).ToList();
        var trs = doc.DocumentNode.SelectNodes("//tr").ToList();
        for (var i = 0; i < trs.Count; i++)
        {
            bool isMealHeader = false;
            if (trs[i].Attributes["class"] != null)
            {
                isMealHeader = trs[i].Attributes["class"].Value == "meal_header";
            }

            if (isMealHeader)
            {
                var dataRows = trs[i].SelectNodes("./following-sibling::*").TakeWhile(tr => !tr.HasAttributes)
                    .Select(tr => new NutritionRecord() { 
                        Meal = WebUtility.HtmlDecode( trs[i].SelectSingleNode("./td[@class='first alt']").InnerText.Trim()), 
                        MealPart = WebUtility.HtmlDecode(tr.SelectSingleNode("./td[@class='first alt']").InnerText.Trim()),
                        Calories = tr.SelectSingleNode(string.Format("./td[not(contains(@class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Calories") + 2)).InnerText,
                        Protein = tr.SelectSingleNode(string.Format("./td[not(contains(@class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Protein") + 2)).InnerText,
                        Fat = tr.SelectSingleNode(string.Format("./td[not(contains(@class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Fat") + 2)).InnerText,
                        Carbs = tr.SelectSingleNode(string.Format("./td[not(contains(@class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Carbs") + 2)).InnerText,
                        Fiber = tr.SelectSingleNode(string.Format("./td[not(contains(@class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Fiber") + 2)).InnerText,
                    });
                list.AddRange(dataRows);
            }
        }

结果:

enter image description here

也要获取列顺序按顺序获取列标题的InnerText,然后使用IndexOf函数获取给定列名的索引,并使用该索引获取值,例如

var orderedColumnsList = doc.DocumentNode.SelectNodes("//tr[@class='labels']/th").Select(td => td.InnerText.Trim()).ToList();
var carbsValue = doc.DocumentNode.SelectSingleNode(string.Format("//tr[@class='resultsTotals']/td[{0}]", orderedColumnsList.IndexOf("Carbs") + 1)).InnerText;