最快的方式来卷曲5000多个网址并将它们插入到php的mysql数据库中?

时间:2014-10-08 09:45:00

标签: php mysql curl pdo

我正在使用php curl从api

获取数据

第一个网址返回50个网址,这50个网址中的每一个都返回500个结果,然后是下一个网址,直到没有更多结果。

我目前正在使用的代码需要5个多小时才能完成,因为从5,000多个http请求中有2百万条记录要插入到mysql中。

目前我正在使用来自petewarden的ParallelCurl课程 - > https://github.com/petewarden/ParallelCurl

这是我的完整代码:

<?php
ini_set('memory_limit', '3000M');
ini_set('max_execution_time', 15000);
require_once('parallelcurl.php');
$host = "localhost";
$user = "user";
$pass = "pass";
$dbname = "db";
try {
# MySQL with PDO_MYSQL
$DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass);
}
catch(PDOException $e) {
echo $e->getMessage();
}

$table="table";
$nextUrl = 0;

function httpGet($url)
{
$ch = curl_init();

curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$output=curl_exec($ch);

curl_close($ch);
return $output;
}

$html = httpGet("https://url1.com");
$arr = json_decode($html, true);
$curl_options = array(
    CURLOPT_RETURNTRANSFER => true,
);
$max_requests = 20;
$parallel_curl = new ParallelCurl($max_requests, $curl_options);
foreach ($arr['apiGroups'] as $key => $category) {

    $getUrl = $category['availableVariants']['get'];

        $parallel_curl->startRequest($getUrl, 'on_request_done');
            while ($nextUrl) {
                $parallel_curl->startRequest($nextUrl, 'on_request_done');
                }

    }
function placeholders($text, $count=0, $separator=","){
$result = array();
if($count > 0){
for($x=0; $x<$count; $x++){
$result[] = $text;
}
}
return implode($separator, $result);
}

// This function gets called back for each request that completes
function on_request_done($content, $url, $ch) {
global $DBH, $parallel_curl, $table, $nextUrl;

$arr = json_decode($content, true);

$j=0;
foreach ($arr['productInfoList'] as $key => $item) {

    $title = $item['productBaseInfo'];
    $url = $item['productUrl'];
    $img = $item['imageUrls']['275x275'];
    $price = $item['sellingPrice']['amount'];
    $pid = $item['productId'];

    $selarr[$j] = array('title' => $title, 'url' => $url, 'imgurl' => $img, 'price' => $price, 'productid' => $pid);
    $j++;
    }
        $datafields = array('title' => '', 'url' => '', 'imgurl' => '', 'price' => '', 'productid' => '' );

        $insert_values = array();
        foreach($selarr as $d){
        $question_marks[] = '('  . placeholders('?', sizeof($d)) . ')';
        $insert_values = array_merge($insert_values, array_values($d));
        }
        $DBH->beginTransaction(); // also helps speed up your inserts
        $sql = "INSERT INTO $table (" . implode(",", array_keys($datafields) ) . ") VALUES " . implode(',', $question_marks);
        $stmt = $DBH->prepare ($sql);
        try {
        $stmt->execute($insert_values);
        } catch (PDOException $e){
        echo $e->getMessage();
        }
        $DBH->commit();


if($arr['nextUrl']) {
        $nextUrl = $arr['nextUrl'];
    }
else {
    $nextUrl = 0;
}
}

$parallel_curl->finishAllRequests();
?>

1 个答案:

答案 0 :(得分:0)

生成相同PHP脚本的多个实例(从CLI运行)并使用集中队列来管理不在URL上重叠的作业...

我通常使用redis作为我的队列,甚至可以考虑使用像this这样的现成库。

实施起来应该很容易......

// pseudocode, in your jobs
$batch = $queue->getFirstAvailableBatchOfUrls();

// Do your stuff

// put again in the queue if errors occurs...
if($status == PROCESS_DONE_OK) {
   $batch->markAsDone()
} else {
   $batch->putAgainInQueue()
}