Question

到目前为止，我有一个用于抓取网站的单线程应用因为我想让它更快，我尝试重建为多线程应用程序。这就是我所做的：
我有一个Crawl类，它包含一个WebBrowser对象。这就是我启动线程的方式：

 Crawler c1 = new Crawler();  
 Thread t1 = new Thread(new ThreadStart(c1.Crawl));  
 t1.SetApartmentState(ApartmentState.STA);
 t1.start()

线程到达此功能：

 LogIn(bool isInit)  
 {  
   browser = new WebBrowser();
   NavigateAndWaitForLoad(browser, "http://www.someurl.com", 1000);
   HtmlElement elemEmail = (HtmlElement)browser.Document.GetElementById("email");  
  }



 void NavigateAndWaitForLoad(WebBrowser wb, string link, int waitTime)  
  {  
   wb.Navigate(link);
   int count = 0;
   while (wb.ReadyState != WebBrowserReadyState.Complete)  
   {  
    Thread.Sleep(sleepTimeMiliseconds);
    Application.DoEvents();
    count++;
    if (count > waitTime / sleepTimeMiliseconds)
        break;  
    }

现在在单线程中它运行良好，然而，在mutlithreded应用程序中它崩溃在这一行： HtmlElement elemEmail =（HtmlElement）browser.Document.GetElementById（“email”）;
非法铸造除外??? !!!
不知道为什么？
请帮忙......

Answer 1

您正在使用WebBrowser对象，Application.DoEvents和Thread.Sleep。坏，坏，坏。你在这里遇到麻烦。

建议：

如果您只是构建网络抓取工具，只需使用WebClient将网页下载为字符串即可。然后，如果您需要将其解析为HTML文档，请使用HtmlAgilityPack。

通过这种方式，您可以避免使用Web浏览器UI控件，可以避免执行Thread.Sleep，可以避免意外的递归诱导Application.DoEvents。

以下是一个示例：

public async void DownloadWebPage(string address)
{
    using(var webClient = new WebClient())
    {
        var webPageContents = await webClient.DownloadStringTaskAsync();

        // Woohoo, we have the contents of the web page. Do anything with it...
        Console.WriteLine(webPageContents);
    }
}

// Usage:
DownloadWebPage("http://www.google.com");

Web浏览器MultiThreaded Casting异常

1 个答案: