LINQ使用HTMLAgilityPack从网站提取数据

时间:2015-08-21 20:36:50

标签: linq c#-4.0 html-agility-pack data-extraction

我使用C#HTMLAgilityPack提取商品名称,价格&来自中文网站的货币符号:https://meadjohnson.world.tmall.com/search.htm?search=y&orderType=defaultSort&s cene = taobao_shop。这里是html的主要内容:

<div class="SaleItems">
    <dl class="item ">
        <dt class="photo"></dt>
        <dd class="detail">
            <a class="item-name">iPad</a>
            <div class="price-area">
                <span class="symbol">USD</span>
                <span class="price">379</span>
            </div>
        </dd>
    </dl>
    <dl class="item ">
        <dt class="photo"></dt>
        <dd class="detail">
            <a class="item-name">iPod</a>
            <div class="price-area">
                <span class="symbol">CAD</span>
                <span class="price">139</span>
            </div>
        </dd>
    </dl>
</div>

到目前为止,这是我的程序的样子。

ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
    | SecurityProtocolType.Tls11
    | SecurityProtocolType.Tls12
    | SecurityProtocolType.Ssl3;

var htmlDocument = htmlWeb.Load(html);
var sItems = doc.DocumentNode.Descendants("SaleItems"); 
foreach (var item in sItems)
{
  var data = new {
         Currency  = item["symbol"].InnerText,
         Price = item["price"].InnerText,
         };
}

这不起作用。我怎样才能解决我做错的事情?

2 个答案:

答案 0 :(得分:1)

您可以通过这种方式提取数据:

var input = @"<div class='SaleItems'>
    <dl class='item '>
        <dt class='photo'></dt>
        <dd class='detail'>
            <a class='item-name'>iPad</a>
            <div class='price-area'>
                <span class='symbol'>USD</span>
                <span class='price'>379</span>
            </div>
        </dd>
    </dl>
    <dl class='item '>
        <dt class='photo'></dt>
        <dd class='detail'>
            <a class='item-name'>iPod</a>
            <div class='price-area'>
                <span class='symbol'>CAD</span>
                <span class='price'>139</span>
            </div>
        </dd>
    </dl>
</div>";
var html = new HtmlDocument();
html.LoadHtml(input);
var root = html.DocumentNode;
var list = new List<Data>();
foreach (var node in root.Descendants("dl"))
{
    var currency = node.Descendants()
       .Where(n => n.GetAttributeValue("class", "").Equals("symbol")).FirstOrDefault().InnerText;
    var price = node.Descendants()
       .Where(n => n.GetAttributeValue("class", "").Equals("price")).FirstOrDefault().InnerText;
    list.Add(new Data { Currency = currency, Price = price});
}

public class Data
{
    public string Currency { get; set; }
    public string Price { get; set; }
}

或者您可以使用query expression代替foreach部分:

var list = (from node in root.Descendants("dl") 
            let currency = node.Descendants().Where(n => n.GetAttributeValue("class", "").Equals("symbol")).FirstOrDefault().InnerText 
            let price = node.Descendants().Where(n => n.GetAttributeValue("class", "").Equals("price")).FirstOrDefault().InnerText 
            select new Data {Currency = currency, Price = price}).ToList();

答案 1 :(得分:0)

确切的错误是在foreach()块&#34;项目&#34;是HtmlNode类型的变量,但您正在尝试索引&#34;它。而不是这个,你应该使用

item.Descendants("symbol") 

item.SelectSingleNode(".//span[@class='symbol']");

或者你可以使用这段代码:

    var document = new HtmlWeb();
    var root = document.Load(url);
    var data = new List<Item>();
    foreach (var item in root.DocumentNode.SelectNodes("//dl"){
        var name = item.SelectSingleNode(".//a[@class='item-name']").InnerText;
        var price = item.SelectSingleNode(".//span[@class='price']").InnerText;
        var symbol = item.SelectSingleNode(".//span[@class='symbol']").InnerText;
        data.Add(new Item(){ Name = name, Price = price, Symbol = symbol });
    }
    public class Item{
        public string Name;
        public int Price;
        public string Symbol;
    }