我试图从2个不同的Div中提取数据,但我只能从第一个Div(城市)获取数据。我将代码设置作为维基页面中的示例,其中H2 id = cities
和id = Other_destinations:
var xpathData = "//h2[span/@id='Cities' or @id='Other_destinations']" + "/following-sibling::ul[1]" + "/li";
然后我将li中的任何内容写入文本文档。
private void button1_Click(object sender, EventArgs e)
{
List<string> destinations = new List<string>();
var xpathData = "//h2[span/@id='Cities' or @id='Other destinations']" + "/following-sibling::ul[1]" + "/li";
WebClient web = new WebClient();
String html = web.DownloadString("http://wikitravel.org/en/Germany");
hap.HtmlDocument doc = new hap.HtmlDocument();
doc.LoadHtml(html);
using (StreamWriter write = new StreamWriter(@"C:\path\testText.txt"))
{
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpathData))
{
string all = node.InnerText;
//Writes to text file
write.WriteLine(all);
}
}
}
关于'hap'
的说明,由于某些奇怪的冲突,我不得不使用hap = HtmlAgilityPack;
。
感谢您提供任何帮助/建议/方向!
答案 0 :(得分:0)
您的原始代码中包含第二个ID的错误输入:
var xpathData = "//h2[span/@id='Cities' or span/@id='Other_destinations']" + "/following-sibling::ul[1]" +
"/li";
这是我使用的代码:
var destinations = new List<string>();
var xpathData = "//h2[span/@id='Cities' or span/@id='Other_destinations']" + "/following-sibling::ul[1]" +
"/li";
var webClient = new WebClient();
var html = webClient.DownloadString("http://wikitravel.org/en/Germany");
// to control the encoding
var doc = new HtmlDocument
{
OptionDefaultStreamEncoding = Encoding.UTF8
};
doc.LoadHtml(html);
using (var write = new StreamWriter("testText.txt"))
{
foreach (var node in doc.DocumentNode.SelectNodes(xpathData))
{
var all = node.InnerText;
//Writes to text file
write.WriteLine(all);
}
}
答案 1 :(得分:0)
更新了工作解决方案
所以现在问题是一些国家有奇怪的加价。大多数Div设置为:
<h2>
<span id="cities"></span>
</h2>
<ul>
<li>...</li>
<li>...</li>
...
</ul>
<h2>
...
</h2>
但是,当我在评论中提到它只是从Other_destinations div中拉出第一个li时,发生的事情是当前脚本只查看第一个ul,然后是div中的li。因此,该特定国家/地区页面上的标记是这样的:
<h2>
<span id="Other_destinations"></span>
</h2>
<ul>
<li>...</li>
<li>...</li>
...
</ul>
<h2>
<span id="Get_in"></span>
</h2>
更新的工作代码
var xpathData = "//ul[preceding-sibling::h2[span/@id='Cities' or span/@id='Other_destinations'] and following-sibling::h2[span/@id='Get_in']]" + "/li";
此查询仅用于从上述HTML格式的网页中获取2个信息部分。一个重要的注意事项是文本需要编码,或者它将打印到带有“ - ”的文本作为“—。我为Web客户端添加了这个编码:
var web = new WebClient();
web.Encoding = System.Text.Encoding.UTF8;
String html = string.Empty;
html = //get URL's
这个文件的编码:
var doc = new hap.HtmlDocument
{
OptionDefaultStreamEncoding = Encoding.UTF8
};
doc.LoadHtml(html);