Question

我正在用C＃编写一个爬虫程序，它以一组文件中已知的url开头。我想把页面拉下来asynch。我的问题是什么是最好的模式，即将文件读入List / Array of urls，创建一个数组来存储完成的URL？我应该创建一个二维数组来跟踪线程和完成的状态吗？还有一些其他注意事项是重试（如果第一个请求很慢或死机）或自动重启（应用程序/系统崩溃）。

Answer 1

foreach (var url in File.ReadAllLines("urls.txt"))
{
    var client = new WebClient();
    client.DownloadStringCompleted += (sender, e) => 
    {
        if (e.Error == null)
        {
            // e.Result will contain the downloaded HTML
        }
        else
        {
            // some error occurred: analyze e.Error property
        }
    };
    client.DownloadStringAsync(new Uri(url));
}

Answer 2

我建议您从队列中提取并在单独的线程中获取每个URL，从队列中剥离，直到您想要允许的同时线程数最多。每个线程都会调用一个回调方法来报告它是成功完成还是遇到问题。

当你启动每个线程时，将其ManagedThreadId放入一个Dictionary中，键是id，值是线程状态。回调方法应返回其id和完成状态。完成时从Dictionary中删除每个线程并启动下一个等待线程。如果未成功完成，则将其添加回队列。

Dictionary的Count属性告诉你有多少线程在飞行，回调也可用于更新你的UI或检查暂停或暂停信号。如果您需要在系统崩溃的情况下保留结果，那么您应该考虑使用数据库表来代替内存驻留集合，如manitra所描述的那样。

对于我来说，这种方法非常适用于大量同步线程。

Answer 3

以下是关于存储数据的意见

我建议您使用关系数据库来存储页面列表，因为它可以使您的任务更容易：

检索要抓取的页面（基本上是最旧的SuccessFullCrawlDate的N页）
添加新发现的页面
将页面标记为已爬网（设置SuccessFullCrawlDate标志）
如果程序崩溃，您的数据已经安全
您可以添加列来存储重试次数，以自动丢弃那些失败超过N次的重试次数......

关系模型的一个例子是：

//this would contain all the crawled pages
table Pages {
    Id bigint,
    Url nvarchar(2000)
    Created DateTime,
    LastSuccessfullCrawlDate DateTime,
    NumberOfRetry  int //increment this when a failure occures, if it reach 10 => set Ignored to True
    Title nvarchar(200)   //this is is where you would put the html
    Content nvarchar(max) //this is is where you would put the html
    Ignored Bool,         //set it to True to ignore this page
}

你也可以使用这种结构的表来处理Referer：

//this would contain all the crawled pages
table Referer {
    ParentId bigint,
    ChildId bigint
}

它可以让您实现自己的网页排名：p

异步设计C＃

3 个答案: