Question

我正在下载一些100mb大的文件并压缩为.gz。当我解压缩它们时，它们的大小为350 mb。

在此文件中，该语言的开头有一些标识符，例如英语为en，德语为de。

旁边有一些我需要保存的数字。此文件中的行的示例如下所示

de Gemeinschaft_nicht-anerkannter_Staaten 1 10446

正如我所说，文件非常大，而且我有多个文件（多个文件总数约为1tb，每个文件约300mb）

现在，我正在做以下事情：

由于还有其他语言，我首先只获得以de开头的语言。

preg_match_all('#^(de\s+.*)#m', file_get_contents('tmpfile.txt'), $matches);

然后我得到name这是一行中的第二列，这将是

Gemeinschaft_nicht-anerkannter_Staaten在我们的示例中。

现在，我正在查看是否存在该特定名称和当前日期的数据库条目。如果没有，我将其保存到数据库中，否则，我会＆＃34; udpate＆＃34;它，通过增加一个特定的数字。但是，我今天早上11点就开始这样做，并且只有＃34;还有21,572个条目，而且还远未结束。获取所有数据可能需要数周时间。有没有比我这样做更快的方式？这是我使用的代码。

请注意，正如我所说，我从1月1日开始每小时下载一个文件，这意味着我有24个文件，每个文件100mb（解压缩300mb）约120天（4个月），这使得它有2,880个文件我必须经历。

<?php
// some requires

// some settings
ignore_user_abort(true);
set_time_limit(0);
ini_set('memory_limit',-1);

// some information about the url and file type
$baseUrl        = 'http://dumps.wikimedia.org/other/pagecounts-raw/';
$baseName       = 'pagecounts-';
$fileName       = '.gz';

// i want to get a file for every hour since the 1st May
$begin      = new DateTime('2014-05-01 00:00:00');
$end        = new DateTime(date('Y-m-d H:i:s'));
$interval = new DateInterval('PT1H');
$dateRange = new DatePeriod($begin, $interval, $end);

iterating through every hour
foreach ($dateRange as $date){
    /**
     * @var DateTime $date
     */

     // building the download url
    $url            = $baseUrl . $date->format('Y') . '/' . $date->format('Y-m') . '/' . $baseName . $date->format('Ymd-H0000') . $fileName;
    print 'Now doing file: ' . $url . '<br>';
    // downloading the file
    file_put_contents("tmpfile.gz", fopen($url , 'r'));
    // unzipping
    unzipfile('tmpfile.gz');
    // getting only german articles
    preg_match_all('#^(de\s+.*)#m', file_get_contents('tmpfile.txt'), $matches);
    $matches = $matches[0];
    foreach ($matches as $match){
        // getting the informations
        $info = explode(' ', $match);

        //check if already exists in database
        $pageStatistic  = new Pagestatistic();
        $state          = $pageStatistic->loadFrom([
            'articleName'       => urldecode($info[1]),
            'articleLanguage'   => 'de',
            'pageViewsDate'     => $date->format('Y-m-d')
        ]);

        if ($state){
            $pageStatistic->articlePageViews = $pageStatistic->articlePageViews + $info[2];
        } else {
            $pageStatistic->articleName         = urldecode($info[1]);
            $pageStatistic->articleLanguage     = 'de';
            $pageStatistic->articlePageViews    = $info[2];
            $pageStatistic->pageViewsDate       = $date->format('Y-m-d');
        }
        // save/update
        $pageStatistic->save();
    }
}

//method to unzip files
function unzipfile($fileName){
    $buffer_size = 4096;
    $out_file_name = str_replace('.gz', '.txt', $fileName);
    $file = gzopen($fileName, 'rb');
    $out_file = fopen($out_file_name, 'wb');
    while(!gzeof($file)) {
        fwrite($out_file, gzread($file, $buffer_size));
    }
    fclose($out_file);
    gzclose($file);
}

从大文本文件中获取数据的最快方法

0 个答案: