Question

我正在尝试解析网站的HTML，比如CNN.com，但每次我使用WebBrowser对象导航时，我都会得到一堆我的对象的空值。我没有使用HTML Agility Pack。每当我调用Navigate方法时，mywebBrowser都包含null和空值。如何填充tagCollection？我尝试使用webClient.DownloadString只是为了获取HTML页面的所有内容，我不能使用它，因为我需要找到所有标签并手动完成它非常混乱。我也不能使用HTML Agility Pack。

        using (WebClient webClient = new WebClient())
        {
            webClient.Encoding = Encoding.UTF8;
            HtmlString = webClient.DownloadString(textBox1.Text);
        }

        WebBrowser mywebBrowser = new WebBrowser();
        Uri address = new Uri("http://www.cnn.com/");
        mywebBrowser.Navigate(address);

        //HtmlString does contain all the HTML from Page
        mywebBrowser.DocumentText = HtmlString; 
        //DocumentText only has "<HTML></HTML> after assignment


        HtmlDocument doc = mywebBrowser.Document;
        HtmlElementCollection tagCollection;
        tagCollection = doc.GetElementsByTagName("<div");

Answer 1

WebBrowser Class允许您执行许多操作，而无需依赖任何外部库。你缺少的是DocumentCompleted Event;它是WebBrowser基本定义的一部分：在到达此部分之前，页面未完全加载，因此相应的信息有问题（或为空）。还要记住，在GetElementsByTagName中你只需要输入标签的名称（不带“＆lt;”）。示例代码显示：

 WebBrowser mywebBrowser;
 private void Form1_Load(object sender, EventArgs e)
 {
     mywebBrowser = new WebBrowser();
     mywebBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(mywebBrowser_DocumentCompleted);

     Uri address = new Uri("http://www.cnn.com/");
     mywebBrowser.Navigate(address);
 }

 private void mywebBrowser_DocumentCompleted(Object sender, WebBrowserDocumentCompletedEventArgs e)
 {
    //Until this moment the page is not completely loaded
     HtmlDocument doc = mywebBrowser.Document;
     HtmlElementCollection tagCollection;
     tagCollection = doc.GetElementsByTagName("div");
 }

如何解析网站HTML内容

1 个答案: