我构建了这个网络抓取工具。
https://github.com/shoutweb/WebsiteCrawlerEmailExtractor
//Regular expression function that scans individual pages for emails
function get_emails_from_webpage($url)
{
$text=file_get_contents($url);
$res = preg_match_all("/[a-z0-9]+[_a-z0-9\.-]*[a-z0-9]+@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})/i",$text,$matches);
if ($res) {
return array_unique($matches[0]);
}
else{
return null;
}
}
//URL Array
$URLArray = array();
//Inputted URL right now it just pulls it from a GET variable but you can do alter this any way you want
$inputtedURL = $_GET['url'];
//Crawling the inputted domain to get the URLS
$urlContent = file_get_contents("http://".urldecode($inputtedURL));
$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$scrapedEmails = array();
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
//array_push($scrapedEmails, $hrefs->length);
// validate url
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
if (strpos($url, $inputtedURL) !== false) {
array_push($URLArray, $url);
}
}
}
//Extracting the emails from URLS that were crawled
foreach ($URLArray as $key => $url) {
$emails = get_emails_from_webpage($url);
if($emails != null){
foreach($emails as $email) {
if(!in_array($email, $scrapedEmails)){
array_push($scrapedEmails,$email);
}
}
}
}
//Ouputting the scraped emails in addition to the the number of URLS crawled
foreach($scrapedEmails as $value) {
echo $value . " " . count($URLArray);
}
它基本上会进入您输入的域,获取所有页面,然后检查是否有电子邮件。
每个域最多可能需要30秒才能抓取。我想看看是否有办法加速这个webcrawler。我想的一种方法是将其限制为仅联系页面,但我无法找到一种聪明的方法。
答案 0 :(得分:0)
如果你的意图不是邪恶的 -
正如评论中所提到的,实现此目的的一种方法是并行执行爬虫(多线程)---而不是一次执行一个域。
类似的东西:
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
在服务器上,您可以设置一个CRON
作业,该作业将自动执行此操作,以便您不会手动运行。