Asp.net Crawler Webresponse Operation Time out out

时间:2010-05-18 05:46:27

标签: c# asp.net web-crawler httpwebresponse

您好我在我的Web应用程序中构建了一个基于线程池的简单Web爬虫。它的工作是抓取自己的应用程序空间,并为每个有效的网页及其元内容构建一个Lucene索引。这是问题所在。当我从Visual Studio Express的调试服务器实例运行爬网程序,并将启动实例作为IIS URL提供时,它可以正常工作。但是,当我不提供IIS实例并且它自己的url启动爬网过程(即爬行自己的域空间)时,我会在Webresponse语句中遇到操作超时异常。有人可以引导我进入我应该或不应该在这里做什么吗?这是我获取页面的代码。它在多线程环境中执行。

private static string GetWebText(string url)
    {
        string htmlText = "";        

        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
        request.UserAgent = "My Crawler";

        using (WebResponse response = request.GetResponse())
        {
            using (Stream stream = response.GetResponseStream())
            {
                using (StreamReader reader = new StreamReader(stream))
                {
                    htmlText = reader.ReadToEnd();
                }
            }
        }
        return htmlText;
    }

以下是我的stacktrace:

at System.Net.HttpWebRequest.GetResponse()
   at CSharpCrawler.Crawler.GetWebText(String url) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 366
   at CSharpCrawler.Crawler.CrawlPage(String url, List`1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 105
   at CSharpCrawler.Crawler.CrawlSiteBuildIndex(String hostUrl, String urlToBeginSearchFrom, List`1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 89
   at crawler_Default.threadedCrawlSiteBuildIndex(Object threadedCrawlerObj) in c:\myAppDev\myApp\site\crawler\Default.aspx.cs:line 108
   at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state)
   at System.Threading.ExecutionContext.runTryCode(Object userData)
   at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
   at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

谢谢,欢呼, 利昂。

1 个答案:

答案 0 :(得分:0)

您的抓取工具发出了多少并发请求?您可能很容易使线程池挨饿 - 特别是当爬虫在网站代码中运行时。

每次请求您的呼叫将使用池中的2个线程 - 一个用于处理请求,另一个用于等待响应。