这是我的代码:
<center>
<br/>
<form method="post" name="scrap_form" id="scrap_form" action="scrape_data.php">
<b>Enter Website URL To Scrape Data:</b>
<input type="input" name="website_url" id="website_url">
<input type="submit" name="submit" value="Submit" >
</form>
</center>
<?php
error_reporting(E_ALL ^ E_NOTICE );
$website_url = $_POST['website_url'];
$result = scrapeWebsiteData($website_url);
function scrapeWebsiteData($website_url){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $website_url);
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_BINARYTRANSFER,1);
$result = curl_exec($curl);
curl_close($curl);
return $result;
}
$regextit = '<div id="case_textlist">(.*?)<\/div>/s';
preg_match_all($regextit, $result, $list);
/* echo "<pre>";
print_r($list[1]); die; */
$regex = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($regex, $result, $url_matches);
$count = count($url_matches[1]);
// set the local path of image
$local_path = 'C:\udeytech\htdocs\tests\images\\';
for($i=0; $i<$count; $i++)
{
preg_match_all('!.*?/!', $url_matches[1][$i], $matches);
$last_part = end($matches[0]);
////match image name last part of anything .jpg|jpeg|gif|png
preg_match("!$last_part(.*?.(jpg|jpeg|gif|png))!", $url_matches[1][$i], $matche);
$secons_part = $matche[0];
$info = pathinfo($secons_part);
$image_name = $info['basename'];
//save image url in a variable
$image_url = $url_matches[1][$i];
$image_path = scrapeWebsiteData($image_url);
$file_open = fopen($local_path.$image_name, 'w');
fwrite($file_open, $image_path);
fclose($file_open);
}
?>
答案 0 :(得分:0)
您是否尝试在浏览器中加载其中任何一个网站并查看回复?
nextdoorhub正在使用angular和atknsn看起来对jQuery很重要。长话短说,这些网站需要运行javascript来呈现您想要抓取的完整HTML。
单独使用PHP + cURL不会削减它。查看讨论scraping angular的线程,这将指出您正确的方向。 (提示:你需要用node.js抓取这些网站)