Question

我正在尝试制作一个小程序来阅读维基百科页面中的内容，并且为了获取html，我在SO上的其他地方找到了这个代码

        HtmlDocument doc = new HtmlDocument();
        StringBuilder output = new StringBuilder();

        doc.LoadHtml("http://en.wikipedia.org/wiki/The Metamorphosis of Prime Intellect");
        var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);

        foreach (string line in text)
            output.AppendLine(line);

        string textOnly = HttpUtility.HtmlDecode(output.ToString());

        Console.WriteLine(textOnly);

但是，我收到运行时错误“ArgumentNullException未处理”，并突出显示该行：

        var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);

有没有人看到这个问题？

Answer 1

doc.LoadHtml需要 html字符串而不是 url 。要下载该页面，您可以使用HtmlAgilityPack.HtmlWeb类

var web = new HtmlAgilityPack.HtmlWeb();
var doc = web.Load("http://en.wikipedia.org/wiki/The Metamorphosis of Prime Intellect");

var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
var output = String.Join("\n", text);

SelectNodes在我的测试中返回622项。

Answer 2

您需要自己进行下载。

例如，您可以使用System.Net命名空间中的WebClient类：

var pageUri = new Uri("http://en.wikipedia.org/wiki/The Metamorphosis of Prime Intellect");
var wc = new WebClient();
var html = wc.DownloadString(uri);

//Then do
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

如果您愿意，还有一个HttpClient课程。

这些优于HtmlWeb，可以在EAP和C＃5 async操作中使用它们。

C＃ - Html Agility Pack - 无法从网上阅读

2 个答案: