用HtmlAgilityPack读取多个Div

时间:2014-01-16 16:03:14

标签: c#-4.0 web-scraping html-agility-pack

我试图从2个不同的Div中提取数据,但我只能从第一个Div(城市)获取数据。我将代码设置作为维基页面中的示例,其中H2 id = citiesid = Other_destinations:

中的所有li元素
var xpathData = "//h2[span/@id='Cities' or @id='Other_destinations']" + "/following-sibling::ul[1]" + "/li";

然后我将li中的任何内容写入文本文档。

private void button1_Click(object sender, EventArgs e)
    {

        List<string> destinations = new List<string>();
        var xpathData = "//h2[span/@id='Cities' or @id='Other destinations']" + "/following-sibling::ul[1]" + "/li";

        WebClient web = new WebClient();
        String html = web.DownloadString("http://wikitravel.org/en/Germany");

        hap.HtmlDocument doc = new hap.HtmlDocument();
        doc.LoadHtml(html);

        using (StreamWriter write = new StreamWriter(@"C:\path\testText.txt"))
        { 
            foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpathData)) 
            {

            string all = node.InnerText;

            //Writes to text file
            write.WriteLine(all);
            }
        }

      }

关于'hap'的说明,由于某些奇怪的冲突,我不得不使用hap = HtmlAgilityPack;

感谢您提供任何帮助/建议/方向!

2 个答案:

答案 0 :(得分:0)

您的原始代码中包含第二个ID的错误输入:

var xpathData = "//h2[span/@id='Cities' or span/@id='Other_destinations']" + "/following-sibling::ul[1]" +
                        "/li";

这是我使用的代码:

var destinations = new List<string>();
var xpathData = "//h2[span/@id='Cities' or span/@id='Other_destinations']" + "/following-sibling::ul[1]" +
                        "/li";

var webClient = new WebClient();
var html = webClient.DownloadString("http://wikitravel.org/en/Germany");

// to control the encoding 
var doc = new HtmlDocument
{
    OptionDefaultStreamEncoding = Encoding.UTF8
};

doc.LoadHtml(html);

using (var write = new StreamWriter("testText.txt"))
{
   foreach (var node in doc.DocumentNode.SelectNodes(xpathData))
   {
       var all = node.InnerText;

       //Writes to text file
       write.WriteLine(all);
   }

}       

答案 1 :(得分:0)

更新了工作解决方案
所以现在问题是一些国家有奇怪的加价。大多数Div设置为:

<h2>
<span id="cities"></span>
</h2>
<ul>
<li>...</li>
<li>...</li>
...
</ul>
<h2>
...
</h2>

但是,当我在评论中提到它只是从Other_destinations div中拉出第一个li时,发生的事情是当前脚本只查看第一个ul,然后是div中的li。因此,该特定国家/地区页面上的标记是这样的:

<h2>
<span id="Other_destinations"></span>
</h2>
<ul>
<li>...</li>
<li>...</li>
...
</ul>
<h2>
<span id="Get_in"></span>
</h2>

更新的工作代码

var xpathData = "//ul[preceding-sibling::h2[span/@id='Cities' or span/@id='Other_destinations'] and following-sibling::h2[span/@id='Get_in']]" + "/li";

此查询仅用于从上述HTML格式的网页中获取2个信息部分。一个重要的注意事项是文本需要编码,或者它将打印到带有“ - ”的文本作为“—。我为Web客户端添加了这个编码:

var web = new WebClient();
web.Encoding = System.Text.Encoding.UTF8;
String html = string.Empty;
html = //get URL's

这个文件的编码:

var doc = new hap.HtmlDocument
{
    OptionDefaultStreamEncoding = Encoding.UTF8
};

doc.LoadHtml(html);