我正试图从页面上删除一些食谱作为学校项目的样本,但页面只是不断加载空白页面。
我正在关注本教程 - here
这是我的代码:
<?php
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$continue = true;
$url = curl("https://www.justapinch.com/recipes/main-course/");
while ($continue == true) {
$results_page = curl($url);
$results_page = scrape_between($results_page,"<div id=\"grid-normal\">","<div id=\"rightside-content\"");
$separate_results = explode("<h3 class=\"tight-margin\"",$results_page);
foreach ($separate_results as $separate_result) {
if ($separate_result != "") {
$results_urls[] = "https://www.justapinch.com" . scrape_between($separate_result,"href=\"","\" class=\"");
}
}
// Commented out to test code above
// if (strpos($results_page,"Next Page")) {
// $continue = true;
// $url = scrape_between($results_page,"<nav><div class=\"col-xs-7\">","</div><nav>");
// if (strpos($url,"Back</a>")) {
// $url = scrape_between($url,"Back</a>",">Next Page");
// }
// $url = "https://www.justapinch.com" . scrape_between($url, "href=\"", "\"");
// } else {
// $continue = false;
// }
// sleep(rand(3,5));
print_r($results_urls);
}
?>
我正在使用cloud9
,我已经安装了php5 cURL
,并且正在运行apache2
。我将不胜感激任何帮助。
答案 0 :(得分:0)
这就是问题所在:
$results_page = curl($url);
您尝试从网址获取内容不,但是从HTML网页获取。因为,在while()
之前,您将$url
设置为页面的结果。我认为你应该做到以下几点:
$results_page = curl("https://www.justapinch.com/recipes/main-course/");
修改强>
您应该将查询html的方式更改为using DOM。
答案 1 :(得分:0)
为什么人们会这样做?代码完全没有错误检查,然后他们去一些论坛并问why is this code, which completely ignores any and all errors, not working?
我不知道,但至少你可能会提出一些错误检查并在请求之前运行。它不仅仅是你,很多人都在做这件事,还有令人讨厌的事情,你应该为这样做感到难过。如果设置选项时出错,curl_setopt将返回bool(false)。如果传输中有错误,curl_exec返回bool(false)。如果创建curl句柄时出错,curl_init将返回bool(false)。使用curl_error提取错误描述,并使用\ RuntimeException报告。现在删除这个线程,添加一些错误检查,如果错误检查没有显示问题,或者确实如此,但你不确定如何修复它,那么就建立一个新线程。
这里有一些错误检查函数包装器可以帮助您入门:
function ecurl_setopt ( /*resource*/$ch , int $option , /*mixed*/ $value ):bool{
$ret=curl_setopt($ch,$option,$value);
if($ret!==true){
//option should be obvious by stack trace
throw new RuntimeException ( 'curl_setopt() failed. curl_errno: ' . return_var_dump ( curl_errno ($ch) ).'. curl_error: '.curl_error($ch) );
}
return true;
}
function ecurl_exec ( /*resource*/$ch):bool{
$ret=curl_exec($ch);
if($ret!==true){
throw new RuntimeException ( 'curl_exec() failed. curl_errno: ' . return_var_dump ( curl_errno ($ch) ).'. curl_error: '.curl_error($ch) );
}
return true;
}
function return_var_dump(/*...*/){
$args = func_get_args ();
ob_start ();
call_user_func_array ( 'var_dump', $args );
return ob_get_clean ();
}