刮除一页以上

时间:2019-03-04 19:43:06

标签: php web curl screen-scraping

我正在尝试从该网站https://aabalat.com/wine/country/france获取数据(名称,品种,格式和价格)。我已经用名称$ urls制作了一个数组,并推送了数组中的每个链接。对于每个新的curl会话,我将获得20个有关酒的新数据。首先,我需要捕获格式并推送到数组,如下面的代码所示。当我打印$ french_wines_formats_matches时,它可以成功工作。但是当我要打印$ french_wines_format_array时,效果不是很好。

我是刮擦的新手,对此我没有太多经验。

    // Array contains 197 links
$urls = array();
array_push($urls, "https://aabalat.com/wine/country/france");


// This for loop makes others links
for($i = 1; $i < 5; $i++)
{
  $urls[] = "https://aabalat.com/wine/country/france?page=".$i;
}

// echo "<pre>";
// print_r($urls);
// echo "</pre>";

$french_wines_array = array();
$french_wines_title_array = array();
$french_wines_varietal_array = array();
$french_wines_format_array = array();
$french_wines_price_array = array();

// Repeat curl session until url exists.
foreach($urls as $url)
{
  $curl = curl_init();
  curl_setopt($curl, CURLOPT_URL, $url);

  curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt($curl, CURLOPT_VERBOSE, true);

  $output = curl_exec($curl);
  $info = curl_getinfo($curl);
  $err = curl_error($curl);
  $ern = curl_errno($curl);

  $french_wine_formats_pattern = '!<span class="wine-list-item-format">(.*)</span>!mi';
  preg_match_all($french_wine_formats_pattern, $output, $french_wines_formats_matches);

  foreach($french_wines_formats_matches[0] as $french_wines_formats_match)
  {
    $french_wines_format_array[] = $french_wines_formats_match;
  }

  echo "<pre>";
  print_r($french_wines_format_array);
  echo "</pre>";

curl_close($curl);
sleep(rand(2, 5));

}

1 个答案:

答案 0 :(得分:0)

您的代码和正则表达式似乎有效(I tried them)。我无法复制您的cURL调用。尝试以下操作,而不只是$output = curl_exec($curl),看看是否遇到任何cURL错误:

    if(!$output = curl_exec($curl)){
        if (curl_error($ch)) {
            die(curl_error($ch));
        }
    }

最后,我尝试了一个简单的file_get_contents(),但似乎可行:

    $url = "https://aabalat.com/wine/country/france";
    $output= file_get_contents($url);