Question

这里是C＃的新手，但我已经使用Java多年了。我试着谷歌搜索这个并得到了几个不是我需要的答案。我想从网站上获取（X）HTML，然后使用DOM（实际上，CSS选择器更可取，但无论如何工作）来获取特定元素。这究竟是如何在C＃中完成的？

Answer 1

要获取HTML，您可以使用 WebClient 对象。

要解析HTML，您可以使用 HTMLAgility librrary。

Answer 2

// prepare the web page we will be asking for
        HttpWebRequest  request  = (HttpWebRequest)
            WebRequest.Create("http://www.stackoverflow.com");

        // execute the request
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();

        // we will read data via the response stream
        Stream resStream = response.GetResponseStream();

        string tempString = null;
        int    count      = 0;
        do
        {
            // fill the buffer with data
            count = resStream.Read(buf, 0, buf.Length);

            // make sure we read some data
                if (count != 0)
            {
            // translate from bytes to ASCII text
            tempString = Encoding.ASCII.GetString(buf, 0, count);

            // continue building the string
            sb.Append(tempString);
            }
        }
        while (count > 0); // any more data to read?

然后使用Xquery表达式或Regex来获取所需的元素

Answer 3

您可以使用System.Net.WebClient或System.Net.HttpWebrequest来获取页面，但类不支持解析元素。

使用 HtmlAgilityPack （http://html-agility-pack.net/）

HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;


HtmlDocument htmlDocument = htmlWeb.Load(url);


// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{ 
    // item mathces your req
    // take the item.
}

Answer 4

我听说您想使用HtmlAgilityPack来处理HTML文件。这将为您提供Linq访问权限，其中 A Good Thing（tm）。您可以使用System.Net.WebClient下载该文件。

Answer 5

您可以使用Html Agility Pack加载html并找到所需的元素。

Answer 6

为了帮助您入门，您可以相当轻松地使用HttpWebRequest来获取网址的内容。从那里，你将不得不做一些事情来解析HTML。这就是它开始变得棘手的地方。您不能使用普通的XML解析器，因为许多（大多数？）网站HTML页面不是100％有效的XML。 Web浏览器具有专门实现的解析器来解决无效部分。在Ruby中，我会使用像Nokogiri之类的东西来解析HTML，所以你可能想要查找它的.NET端口，或者另外一个专门用来读取HTML的解析器。

编辑：

由于主题很可能出现：WebClient vs. HttpWebRequest/HttpWebResponse

另外，感谢其他人注意到HtmlAgility。我不知道它存在。

Answer 7

使用html agility pack，这是解析html的一个比较常见的库。

http://htmlagilitypack.codeplex.com/

在C＃中从网站抓取内容

7 个答案: