PHP cURL webscrape ..我遇到了一些问题。它返回我的空白页面。当我试图获得具体内容时..
例如......
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<table>
<tr>
<td>test1</td>
<td>test1</td>
<td>test1</td>
</tr>
</table>
</body>
</html>
我只需要<tr></tr>
这是我的代码段。
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_SSL_VERIFYPEER => false,
// CURLOPT_CAINFO => 'cacert.pem',
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$url = "https://www.weddingwire.com/c/ak-alaska/wedding-officiants/9-sca.html";
$results_page = curl($url); // Downloading the results page using our curl() funtion
$results_page = scrape_between($results_page, '<div class="js-search-results">', '<div class="col-xs-12 testing-catalog-pagination-links">'); // Scraping out only the middle section of the results page that contains our results
这是我必须给予学分的源代码.. http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/
答案 0 :(得分:0)
使用纯字符串函数是一种解析标记数据的方法。正则表达式提供了更大的灵活性,但基于它们的解决方案通常对标记结构的微小更改几乎没有鲁棒性。
最好是使用DOM解析器。它们是正确的工具,专门为此类任务而精心设计。将标记抛出到解析器对象中,然后您可以“浏览”结构并选择并提取您需要的任何数据。
看看这个简单的例子:
<?php
require_once 'simple_html_dom.php';
$markup = <<<EOT
<html>
<body>
<div>foo 1</div>
<div class="category-landing-links">
<div class="col-xs-4">bla 1</div>
<div class="col-xs-4">bla 2</div>
<div class="col-xs-4">bla 3</div>
</div>
<div>foo 2</div>
</body>
</html>
EOT;
$htmlDom = str_get_html($markup);
$outerDivs = $htmlDom->find('div[class=category-landing-links]');
$finalData = [];
foreach ($outerDivs as $key=>$outerDiv) {
foreach ($outerDiv->children() as $innerDiv) {
$finalData[$key][] = $innerDiv->innertext;
}
}
var_dump($finalData);
以上的输出是:
array(1) {
[0] =>
array(3) {
[0] =>
string(5) "bla 1"
[1] =>
string(5) "bla 2"
[2] =>
string(5) "bla 3"
}
}
这是一个数组,其中包含每个匹配的外部<div>
标记的条目,该条目依次保存所有子标记<div>
的内部文本。根据具体情况,您可能需要根据自己的需要进行调整。
simple_html_dom.php
是一个非常简单的DOM解析器的实现。它有点旧,但工作得很好。它是SourceForge上提供的免费软件。在线文档提供了易于理解和演示大多数功能的示例。