我有一张Alexa排名前100万的名单。我想查看这100万个网站中哪些是具有网页www.domain.com/pageNameUrl的网站。 我试过了
foreach($sites as $site){
$file_headers = @get_headers($site);
if(strpos($file_headers[0],"200 OK") !== false) {
$exists = true;
//save site name code...
} else {
$exists = false;
}
}
但是这段代码需要花费太多时间。通过所有网站需要1个月甚至更长时间。还有其他更快的方法吗?
答案 0 :(得分:0)
我认为php不适合那份工作。您可能会考虑类似nodeJs的东西,它非常适合异步作业。看看这个(例子来自https://npmjs.org/package/crawler)
var Crawler = require("crawler").Crawler;
var c = new Crawler({
// here you can define, how many pages you want to do in parallel
"maxConnections":10,
// This will be called for each crawled page
"callback":function(error,result,$) {
// mark this page as available or not based on the reponse
console.log(result.statusCode);
}
});
// Queue all your urls in a loop, they all will be push asynchronously to the crawler job
c.queue("http://www.google.de");
c.queue("http://www.amazon.de");
c.queue("http://www.facebook.de");