Question

我在处理大文件时没有经验，所以我不知道该怎么做。我试图使用 file_get_contents 读取几个大文件;任务是使用 preg_replace（）来清理和消除它们。

我的代码在小文件上运行良好;但是，大文件（40 MB）会触发内存耗尽错误：

PHP Fatal error:  Allowed memory size of 16777216 bytes exhausted (tried to allocate 41390283 bytes)

我正在考虑使用 fread（），但我不确定它是否也能正常工作。这个问题有解决方法吗？

感谢您的意见。

这是我的代码：

<?php
error_reporting(E_ALL);

##get find() results and remove DOS carriage returns.
##The error is thrown on the next line for large files!
$myData = file_get_contents("tmp11");
$newData = str_replace("^M", "", $myData);

##cleanup Model-Manufacturer field.
$pattern = '/(Model-Manufacturer:)(\n)(\w+)/i';
$replacement = '$1$3';
$newData = preg_replace($pattern, $replacement, $newData);

##cleanup Test_Version field and create comma delimited layout.
$pattern = '/(Test_Version=)(\d).(\d).(\d)(\n+)/';
$replacement = '$1$2.$3.$4      ';
$newData = preg_replace($pattern, $replacement, $newData);

##cleanup occasional empty Model-Manufacturer field.
$pattern = '/(Test_Version=)(\d).(\d).(\d)      (Test_Version=)/';
$replacement = '$1$2.$3.$4      Model-Manufacturer:N/A--$5';
$newData = preg_replace($pattern, $replacement, $newData);

##fix occasional Model-Manufacturer being incorrectly wrapped.
$newData = str_replace("--","\n",$newData);

##fix 'Binary file' message when find() utility cannot id file.
$pattern = '/(Binary file).*/';
$replacement = '';
$newData = preg_replace($pattern, $replacement, $newData);
$newData = removeEmptyLines($newData);

##replace colon with equal sign
$newData = str_replace("Model-Manufacturer:","Model-Manufacturer=",$newData);

##file stuff
$fh2 = fopen("tmp2","w");
fwrite($fh2, $newData);
fclose($fh2);

### Functions.

##Data cleanup
function removeEmptyLines($string)
{
        return preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $string);
}
?>

Answer 1

首先，您应该了解当使用file_get_contents时，您将整个数据字符串提取到变量中，变量存储在主机内存中。

如果该字符串大于PHP进程专用的大小，那么PHP将暂停并显示上面的错误消息。

以此方式打开文件作为指针，然后一次取一个块，这样如果你有一个500MB的文件，你可以读取前1MB的数据，做你想用的，删除从系统内存中取出1MB并替换为下一个MB，这样您就可以管理放入内存的数据量。

如果可以在下面看到这个例子，我将创建一个类似于node.js

的函数

function file_get_contents_chunked($file,$chunk_size,$callback)
{
    try
    {
        $handle = fopen($file, "r");
        $i = 0;
        while (!feof($handle))
        {
            call_user_func_array($callback,array(fread($handle,$chunk_size),&$handle,$i));
            $i++;
        }

        fclose($handle);

    }
    catch(Exception $e)
    {
         trigger_error("file_get_contents_chunked::" . $e->getMessage(),E_USER_NOTICE);
         return false;
    }

    return true;
}

然后像这样使用：

$success = file_get_contents_chunked("my/large/file",4096,function($chunk,&$handle,$iteration){
    /*
        * Do what you will with the {&chunk} here
        * {$handle} is passed in case you want to seek
        ** to different parts of the file
        * {$iteration} is the section fo the file that has been read so
        * ($i * 4096) is your current offset within the file.
    */

});

if(!$success)
{
    //It Failed
}

您会发现其中一个问题是您尝试在非常大的数据块上多次执行正则表达式，不仅如此，而且您的正则表达式是为匹配整个文件而构建的。

使用上面的方法你的正则表达式可能变得无用，因为你可能只匹配一组半数据，你应该做的是恢复到本机字符串函数，如

strpos
substr
trim
explode

为了匹配字符串，我在回调中添加了支持，以便传递句柄和当前迭代，这将允许您直接在回调中处理文件，允许您使用fseek之类的函数，例如ftruncate和fwrite。

您构建字符串操作的方式无论如何都没有效率，并且使用上面提出的方法是一种更好的方法。

希望这有帮助。

Answer 2

根据文件大小调整内存限制的一个非常难看的解决方案：

$filename = "yourfile.txt";
ini_set ('memory_limit', filesize ($filename) + 4000000);
$contents = file_get_contents ($filename);

正确的解决方案是考虑是否可以用较小的块处理文件，或者使用PHP中的命令行工具。

如果您的文件是基于行的，您也可以使用fgets逐行处理。

Answer 3

要一次只处理n行，我们可以在PHP中使用generators。

n（使用1000）

这是它的工作方式读取n行，对其进行处理，在n + 1处返回，然后读取n行，对其进行处理，然后再读取n行，依此类推。

这是执行此操作的代码。

<?php
class readLargeCSV{

    public function __construct($filename, $delimiter = "\t"){
        $this->file = fopen($filename, 'r');
        $this->delimiter = $delimiter;
        $this->iterator = 0;
        $this->header = null;
    }

    public function csvToArray()
    {
        $data = array();
        while (($row = fgetcsv($this->file, 1000, $this->delimiter)) !== false)
        {
            $is_mul_1000 = false;
            if(!$this->header){
                $this->header = $row;
            }
            else{
                $this->iterator++;
                $data[] = array_combine($this->header, $row);
                if($this->iterator != 0 && $this->iterator % 1000 == 0){
                    $is_mul_1000 = true;
                    $chunk = $data;
                    $data = array();
                    yield $chunk;
                }
            }
        }
        fclose($this->file);
        if(!$is_mul_1000){
            yield $data;
        }
        return;
    }
}

为了阅读它，您可以使用它。

    $file = database_path('path/to/csvfile/XYZ.csv');
    $csv_reader = new readLargeCSV($file, ",");


    foreach($csv_reader->csvToArray() as $data){
     // you can do whatever you want with the $data.
    }

这里$data包含来自csv或n％1000的1000个条目，这些条目将用于最后一批。

对此的详细说明可以在https://medium.com/@aashish.gaba097/database-seeding-with-large-files-in-laravel-be5b2aceaa0b

中找到

Answer 4

我的建议是使用fread。它可能会慢一点，但你不必用你所有的记忆...... 例如：

//This use filesize($oldFile) memory
file_put_content($newFile, file_get_content($oldFile));
//And this 8192 bytes
$pNew=fopen($newFile, 'w');
$pOld=fopen($oldFile, 'r');
while(!feof($pOld)){
    fwrite($pNew, fread($pOld, 8192));
}

file_get_contents =＆gt; PHP致命错误：允许的内存耗尽

4 个答案: