PHP cURL webscrape ..我遇到了一些问题。它返回我的空白页面。当我试图获得具体内容时

时间:2016-12-21 10:37:25

标签: php html curl web-scraping web-crawler

PHP cURL webscrape ..我遇到了一些问题。它返回我的空白页面。当我试图获得具体内容时..

例如......

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>

  <table>
    <tr>
       <td>test1</td>
       <td>test1</td>
       <td>test1</td>
    </tr>
  </table>

</body>
</html>

我只需要<tr></tr>

的内容

这是我的代码段。

// Defining the basic cURL function
function curl($url) {
    // Assigning cURL options to an array
    $options = Array(
        CURLOPT_SSL_VERIFYPEER => false,
        // CURLOPT_CAINFO => 'cacert.pem',
        CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
        CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
        CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
        CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
        CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
        CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
        CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
        CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
    );

    $ch = curl_init();  // Initialising cURL 
    curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
    $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
    curl_close($ch);    // Closing cURL 
    return $data;   // Returning the data from the function 
}




// Defining the basic scraping function
function scrape_between($data, $start, $end){
    $data = stristr($data, $start); // Stripping all data from before $start
    $data = substr($data, strlen($start));  // Stripping $start
    $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
    $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
    return $data;   // Returning the scraped data from the function
}



$url = "https://www.weddingwire.com/c/ak-alaska/wedding-officiants/9-sca.html";      

$results_page = curl($url); // Downloading the results page using our curl() funtion

$results_page = scrape_between($results_page, '<div class="js-search-results">', '<div class="col-xs-12 testing-catalog-pagination-links">'); // Scraping out only the middle section of the results page that contains our results

Data That I need to parsed!

这是我必须给予学分的源代码.. http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/

1 个答案:

答案 0 :(得分:0)

使用纯字符串函数是一种解析标记数据的方法。正则表达式提供了更大的灵活性,但基于它们的解决方案通常对标记结构的微小更改几乎没有鲁棒性。

最好是使用DOM解析器。它们是正确的工具,专门为此类任务而精心设计。将标记抛出到解析器对象中,然后您可以“浏览”结构并选择并提取您需要的任何数据。

看看这个简单的例子:

<?php
require_once 'simple_html_dom.php';

$markup = <<<EOT
<html>
  <body>
    <div>foo 1</div>
    <div class="category-landing-links">
      <div class="col-xs-4">bla 1</div>
      <div class="col-xs-4">bla 2</div>
      <div class="col-xs-4">bla 3</div>
    </div>
    <div>foo 2</div>
  </body>
</html>
EOT;

$htmlDom = str_get_html($markup);
$outerDivs = $htmlDom->find('div[class=category-landing-links]');

$finalData = [];
foreach ($outerDivs as $key=>$outerDiv) {
  foreach ($outerDiv->children() as $innerDiv) {
    $finalData[$key][] = $innerDiv->innertext;
  }
}
var_dump($finalData);

以上的输出是:

array(1) {
  [0] =>
  array(3) {
    [0] =>
    string(5) "bla 1"
    [1] =>
    string(5) "bla 2"
    [2] =>
    string(5) "bla 3"
  }
}

这是一个数组,其中包含每个匹配的外部<div>标记的条目,该条目依次保存所有子标记<div>的内部文本。根据具体情况,您可能需要根据自己的需要进行调整。

simple_html_dom.php是一个非常简单的DOM解析器的实现。它有点旧,但工作得很好。它是SourceForge上提供的免费软件。在线文档提供了易于理解和演示大多数功能的示例。