(顺便说一句,我是在有问题的网站的许可下抓取这些东西的。)
非常简单的web scraper,当我手动加载所有链接时工作正常,但是当我尝试通过JSON和变量加载它们时(所以我可以使用一个脚本进行大量的抓取并使通过向JSON添加更多链接来处理更多模块化,它在无限循环上运行。
(页面已加载约15分钟)
这是我的JSON。只有一家商店在那里进行测试,但还有大约15家。
[
{
"store":"Incu Men",
"cat":"Accessories",
"general_cat":"Accessories",
"spec_cat":"accessories",
"url":"http://www.incuclothing.com/shop-men/accessories/",
"baseurl":"http://www.incuclothing.com",
"next_select":"a.next",
"prod_name_select":".infobox .fn",
"label_name_select":".infobox .brand",
"desc_select":".infobox .description",
"price_select":"#price",
"mainImg_select":"",
"more_imgs":".product-images",
"product_url":".hproduct .photo-link"
}
]
这是PHP scraper代码:
<?php
//Set infinite time limit
set_time_limit (0);
// Include simple html dom
include('simple_html_dom.php');
// Defining the basic cURL function
function curl($url) {
$ch = curl_init();
// Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url);
// Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Setting cURL's option to return the webpage data
$data = curl_exec($ch);
// Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch);
// Closing cURL
return $data;
// Returning the data from the function
}
function getLinks($catURL, $prodURL, $baseURL, $next_select) {
$urls = array();
while($catURL) {
echo "Indexing: $url" . PHP_EOL;
$html = str_get_html(curl($catURL));
foreach ($html->find($prodURL) as $el) {
$urls[] = $baseURL . $el->href;
}
$next = $html->find($next_select, 0);
$url = $next ? $baseURL . $next->href : null;
echo "Results: $next" . PHP_EOL;
}
return $urls;
}
$string = file_get_contents("jsonWorkers/incuMens.json");
$json_array = json_decode($string,true);
foreach ($json_array as $value){
$baseURL = $value['baseurl'];
$catURL = $value['url'];
$store = $value['store'];
$general_cat = $value['general_cat'];
$spec_cat = $value['spec_cat'];
$next_select = $value['next_select'];
$prod_name = $value['prod_name_select'];
$label_name = $value['label_name_select'];
$description = $value['desc_select'];
$price = $value['price_select'];
$prodURL = $value['product_url'];
if (!is_null($value['mainImg_select'])){
$mainImg = $value['mainImg_select'];
}
$more_imgs = $value['more_imgs'];
$allLinks = getLinks($catURL, $prodURL, $baseURL, $next_select);
}
?>
为什么脚本无限运行并且没有返回任何内容/停止/打印任何屏幕的任何想法?我要让它一直运行直到它停止。当我手工做这件事时,它只需要一分钟左右,有时甚至更少,所以我确定这是我的变量/ json的问题,但我不能为我的生活看到问题所在。
任何人都可以快速查看并指出我正确的方向吗?
答案 0 :(得分:3)
您的while($catURL)
循环存在问题。你想让我做什么 ?
此外,您可以使用flush()
命令强制在浏览器上显示信息。