使用多卷曲来刮掉大约300000个URL会阻止xampp响应(PHP& XML)

时间:2014-07-02 19:06:33

标签: php xml curl

我有一个XML文档,其中包含超过300,000个URL(loc's)

fulltest.xml(最多300,000个位置) - 缩减为一个例子

    <?xml version="1.0" encoding="utf-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>http://url.com/122122-rob-jones?</loc>
        <lastmod>2014-05-05T07:12:41+08:00</lastmod>
        <changefreq>monthly</changefreq>
        <priority>0.9</priority>
    </url>
    </urlset>

使用这些300,000个网址,我试图从使用multi_curl

中删除数据

index.php(从XML文档中收集url,然后使用mutl curl从中抓取数据

<?php
ini_set('memory_limit', '-1');
include 'config.php';
include 'SimpleLargeXMLParser.class.php';




$xml = dirname(__FILE__)."/fulltest.xml"; // 26969 URLS
$parser = new SimpleLargeXMLParser();
$parser->loadXML($xml);
$parser->registerNamespace("urlset", "http://www.sitemaps.org/schemas/sitemap/0.9");
$array = $parser->parseXML("//urlset:url/urlset:loc");




$node_count = count($array);
$curl_arr = array();
$master = curl_multi_init();
// total: 26969
for($i = 0; $i < $node_count; $i++)
{
    $url =$array[$i];

    $curl_arr[$i] = curl_init($url);
    curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl_arr[$i], CURLOPT_HEADER, 0);
    curl_setopt($curl_arr[$i], CURLOPT_CONNECTTIMEOUT, 120);
    curl_multi_add_handle($master, $curl_arr[$i]);

}

do {
    curl_multi_exec($master,$running);
} while($running > 0);

for($i = 0; $i < $node_count; $i++)
{

    $results = curl_multi_getcontent  ( $curl_arr[$i]  );

    // Player ID

      $playeridTAG = '/<input type="checkbox" id="player-(.+?)" name="player" value="(.+?)" class="player-check" \/>/';
    preg_match($playeridTAG, $results, $playerID);      

    // End Player ID

    // more values to be added once working.


    $query = $db->query('SELECT * FROM playerblank WHERE playerID = '.$playerID[1].'');
    if($query->num_rows == 0) {

 $db->query('INSERT INTO playerblank SET playerID = '.$playerID[1].'') or die(mysqli_error($db));


    }


}


?>

如果我将网址限制在1000左右,这个脚本就可以工作了,那么在没有xampp控制停止响应的情况下执行我正在尝试使用这么多网址的最佳方法是什么。

我已将php.ini中的memory_limit更改为-1

1 个答案:

答案 0 :(得分:2)

您可以使用array_chunk将您的请求分组为1000个网址:

...
$node_count = count($array);
$urls = array();
for($i = 0; $i < $node_count; $i++)
{
    $urls[] = $array[$i];
}

$urlChunks = array_chunk($urls, 1000);

foreach ($urlChunks as $urlChunk) {
    $curl_arr = array();
    $master = curl_multi_init();

    $chunkSize = sizeof($urlChunk);

    for($i = 0; $i <= $chunkSize; $i++) {
        $curl_arr[$i] = curl_init($url);
        curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl_arr[$i], CURLOPT_HEADER, 0);
        curl_setopt($curl_arr[$i], CURLOPT_CONNECTTIMEOUT, 120);
        curl_multi_add_handle($master, $curl_arr[$i]);
    }

    do {
        curl_multi_exec($master,$running);
    } while($running > 0);

    for($i = 0; $i <= $chunkSize; $i++) {
        $results = curl_multi_getcontent  ( $curl_arr[$i]  );

        …….
    }

}