Question

我无法使用卷发从少数网站上抓取数据。
我正在使用CURL从网址抓取网站。它在我使用的80％的网址中运行良好。但是有些网址似乎并不“可擦”。例如，当我尝试抓取https://www.nextdoorhub.com/和https://www.atknsn.com/时，它不起作用。网站一直显示空白，最后它不会返回结果。

这是我的代码：

<center>
<br/>
    <form method="post" name="scrap_form" id="scrap_form" action="scrape_data.php">
         <b>Enter Website URL To Scrape Data:</b>
        <input type="input" name="website_url" id="website_url">
        <input type="submit" name="submit" value="Submit" >
    </form>
</center>
<?php
error_reporting(E_ALL ^ E_NOTICE );
  $website_url = $_POST['website_url'];
 $result =  scrapeWebsiteData($website_url);

 function scrapeWebsiteData($website_url){

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $website_url);
    curl_setopt($curl, CURLOPT_HEADER, 0);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_BINARYTRANSFER,1);
    $result = curl_exec($curl);
    curl_close($curl);
    return $result;
 }
  $regextit = '<div id="case_textlist">(.*?)<\/div>/s';
   preg_match_all($regextit, $result, $list);
  /* echo "<pre>";
  print_r($list[1]); die; */
  $regex = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i'; 
  preg_match_all($regex, $result, $url_matches);
  $count = count($url_matches[1]);
  // set the local path of image 
  $local_path = 'C:\udeytech\htdocs\tests\images\\'; 
   for($i=0; $i<$count; $i++)
    {
     preg_match_all('!.*?/!', $url_matches[1][$i], $matches);
     $last_part = end($matches[0]); 
     ////match image name last part of anything .jpg|jpeg|gif|png
     preg_match("!$last_part(.*?.(jpg|jpeg|gif|png))!", $url_matches[1][$i], $matche);
     $secons_part = $matche[0];
     $info = pathinfo($secons_part);
     $image_name = $info['basename'];
    //save image url in a variable
    $image_url = $url_matches[1][$i];
    $image_path = scrapeWebsiteData($image_url);

    $file_open = fopen($local_path.$image_name, 'w');
    fwrite($file_open, $image_path);
    fclose($file_open);      
   }

?>

Answer 1

您是否尝试在浏览器中加载其中任何一个网站并查看回复？

nextdoorhub正在使用angular和atknsn看起来对jQuery很重要。长话短说，这些网站需要运行javascript来呈现您想要抓取的完整HTML。

单独使用PHP + cURL不会削减它。查看讨论scraping angular的线程，这将指出您正确的方向。（提示：你需要用node.js抓取这些网站）

它不起作用。该网站一直显示空白

1 个答案: