我已配置allow_url_fopen=0
以防止报废工具。配置是在全局模式下完成的,我不允许覆盖本地php.ini文件。但是,我注意到,如果抓取工具基于cURL,则可以绕过该标志。查看下面的给定页面复印机功能,我使用给定功能从配置allow_url_fopen=0
的服务器成功复制了页面。
public function handle()
{
try{
if( ini_get('allow_url_fopen') ) {
Log::info('Flag allow_url_fopen is enabled');
$html = new Htmldom('page_url_here');
} else {
Log::info('Flag allow_url_fopen is disabled trying with cURL');
$webpage = EventCron::get_web_page('page_url_here');
$html = new Htmldom($webpage['content']);
}
/*Doing some magical stuff with the site content */
$agenda = $html->find('div.articles' , 0);
Log::info('success');
}catch(\Exception $e){
Log::error('Event Cron Error '.$e->getMessage());
}
}
public static function get_web_page( $url, $cookiesIn = '' ){
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => "",
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLINFO_HEADER_OUT => true,
CURLOPT_SSL_VERIFYPEER => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_COOKIE => $cookiesIn
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$rough_content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header_content = substr($rough_content, 0, $header['header_size']);
$body_content = trim(str_replace($header_content, '', $rough_content));
$pattern = "#Set-Cookie:\\s+(?<cookie>[^=]+=[^;]+)#m";
preg_match_all($pattern, $header_content, $matches);
$cookiesOut = implode("; ", $matches['cookie']);
$page['errno'] = $err;
$page['errmsg'] = $errmsg;
$page['headers'] = $header_content;
$page['content'] = $body_content;
$page['cookies'] = $cookiesOut;
return $page;
}
现在的问题是,如何防止页面被破坏/报废?如果没有这种事情允许我们这样做,可能是PHP中的一个安全问题。我找到了一种替代方法,可以通过禁用cURL
库来防止这种情况的发生,但这不是正确的解决方案。我的一些托管项目需要使用cURL
库,因为它是最常用的库,并且在Web开发人员中很流行。