Question

我只想从网站下载内容。最好的方法是什么？我试过了WebClient但是使用它我也得到了所有的标签。我只想要内容..

以下是我的代码：

 WebClient w = new WebClient();

//Using DownloadString
 string s = w.DownloadString("http://en.wikipedia.org/wiki/Main_Page");
 Console.WriteLine(s);

//Using DownloadData
 byte[] downloadedData = w.DownloadData("http://en.wikipedia.org/wiki/Main_Page");
 string data = Encoding.ASCII.GetString(downloadedData);
 Console.WriteLine(data);

有什么建议吗？

Answer 1

我想你想剥离下载的html并解析网址的内容？

出于这样的目的，我有一个静态类（在stackoverflow上找到）：

public static class StringExtensions
{
    public static string StripHTML(this string htmlString)
    {
        if (string.IsNullOrEmpty(htmlString)) return htmlString;

        string pattern = @"<(.|\n)*?>";

        string s = Regex.Replace(htmlString, pattern, string.Empty);

        return s;
    }
}

你可以这样使用它：

string s = SomeDownloadFunction("http://en.wikipedia.org/wiki/Main_Page");
string content = s.StripHTML();

Answer 2

虽然使用RegEx可以轻松实现删除标记，但如果您想要检索页面上的所有实际内容（忽略广告，导航栏等），那么这是一项非常艰巨的任务。幸运的是，一些非常聪明的人很乐意分享他们在这方面的研究。请查看boilerpipe（演示here）。

下载指定URL的内容

2 个答案: