我正在使用php curl从api
获取数据第一个网址返回50个网址,这50个网址中的每一个都返回500个结果,然后是下一个网址,直到没有更多结果。
我目前正在使用的代码需要5个多小时才能完成,因为从5,000多个http请求中有2百万条记录要插入到mysql中。
目前我正在使用来自petewarden的ParallelCurl课程 - > https://github.com/petewarden/ParallelCurl
这是我的完整代码:
<?php
ini_set('memory_limit', '3000M');
ini_set('max_execution_time', 15000);
require_once('parallelcurl.php');
$host = "localhost";
$user = "user";
$pass = "pass";
$dbname = "db";
try {
# MySQL with PDO_MYSQL
$DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass);
}
catch(PDOException $e) {
echo $e->getMessage();
}
$table="table";
$nextUrl = 0;
function httpGet($url)
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$output=curl_exec($ch);
curl_close($ch);
return $output;
}
$html = httpGet("https://url1.com");
$arr = json_decode($html, true);
$curl_options = array(
CURLOPT_RETURNTRANSFER => true,
);
$max_requests = 20;
$parallel_curl = new ParallelCurl($max_requests, $curl_options);
foreach ($arr['apiGroups'] as $key => $category) {
$getUrl = $category['availableVariants']['get'];
$parallel_curl->startRequest($getUrl, 'on_request_done');
while ($nextUrl) {
$parallel_curl->startRequest($nextUrl, 'on_request_done');
}
}
function placeholders($text, $count=0, $separator=","){
$result = array();
if($count > 0){
for($x=0; $x<$count; $x++){
$result[] = $text;
}
}
return implode($separator, $result);
}
// This function gets called back for each request that completes
function on_request_done($content, $url, $ch) {
global $DBH, $parallel_curl, $table, $nextUrl;
$arr = json_decode($content, true);
$j=0;
foreach ($arr['productInfoList'] as $key => $item) {
$title = $item['productBaseInfo'];
$url = $item['productUrl'];
$img = $item['imageUrls']['275x275'];
$price = $item['sellingPrice']['amount'];
$pid = $item['productId'];
$selarr[$j] = array('title' => $title, 'url' => $url, 'imgurl' => $img, 'price' => $price, 'productid' => $pid);
$j++;
}
$datafields = array('title' => '', 'url' => '', 'imgurl' => '', 'price' => '', 'productid' => '' );
$insert_values = array();
foreach($selarr as $d){
$question_marks[] = '(' . placeholders('?', sizeof($d)) . ')';
$insert_values = array_merge($insert_values, array_values($d));
}
$DBH->beginTransaction(); // also helps speed up your inserts
$sql = "INSERT INTO $table (" . implode(",", array_keys($datafields) ) . ") VALUES " . implode(',', $question_marks);
$stmt = $DBH->prepare ($sql);
try {
$stmt->execute($insert_values);
} catch (PDOException $e){
echo $e->getMessage();
}
$DBH->commit();
if($arr['nextUrl']) {
$nextUrl = $arr['nextUrl'];
}
else {
$nextUrl = 0;
}
}
$parallel_curl->finishAllRequests();
?>
答案 0 :(得分:0)
生成相同PHP脚本的多个实例(从CLI运行)并使用集中队列来管理不在URL上重叠的作业...
我通常使用redis作为我的队列,甚至可以考虑使用像this这样的现成库。
实施起来应该很容易......
// pseudocode, in your jobs
$batch = $queue->getFirstAvailableBatchOfUrls();
// Do your stuff
// put again in the queue if errors occurs...
if($status == PROCESS_DONE_OK) {
$batch->markAsDone()
} else {
$batch->putAgainInQueue()
}