c#HtmlAgilityPack HTML解析问题

时间:2013-06-14 12:49:37

标签: html-parsing

我有这个HTML

<div class="postrow firs">
        <h2 class="title icon">
            This is the title
        </h2>
        <div class="content">
            <div id="post_message_1668079">
                <blockquote class="postcontent restore ">
                <div>Category</div>
                                         <div>Authour: Kim</div>
                    line 1<br /> line2
                </blockquote>
            </div>
        </div>
    </div>      <div class="postrow">
        <h2 class="title icon">
            This is the title
        </h2>
        <div class="content">
            <div id="post_message_1668079">
                <blockquote class="postcontent restore ">
                <div>Category</div>
                    line 1<br /> line2
                </blockquote>
            </div>
        </div>
    </div>

我想从每个具有“postrow”类的div中提取以下内容,并且还可能有另一个类,如<div class="postrow first">。因此,班级“第一”不是我的关注,只需要在开头就有“后置”。

  1. 标题内包含类标题
  2. 的内容
  3. 来自“blockquote”标记的HTML。但不是任何与此有关的div 标签
  4. 我试过的代码:

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                doc.LoadHtml("http://localhost/vanilla/");
                List<string> facts = new List<string>();
                foreach (HtmlNode li in doc.DocumentNode.SelectNodes("//div[@class='postrow']"))
                {
                    facts.Add(li.InnerHtml);
                    foreach (String s in facts)
                    {
                        textBox1.Text += s + "/n";
                    }
                }
    

1 个答案:

答案 0 :(得分:1)

您的代码存在问题,您必须将html作为字符串而不是路径

doc.LoadHtml("http://localhost/vanilla/");

代替

var request = (HttpWebRequest)WebRequest.Create("http://localhost/vanilla/");
String response = request.GetResponse();

doc.loadHtml(response);

现在迭代解析的html