Question

我有两个文件，文件大约5mb，文件b大约66mb。我需要找出文件a中的行是否有任何出现，在文件b中，如果是，请将它们写入文件c。

这是我目前处理它的方式：

ini_set("memory_limit","1000M");
set_time_limit(0);
$small_list=file("a.csv");
$big_list=file_get_contents("b.csv");
$new_list="c.csv";
$fh = fopen($new_list, 'a');
foreach($small_list as $one_line)
{
 if(stristr($big_list, $one_line) != FALSE) 
    {
    fwrite($fh, $one_line);
    echo "record found: " . $one_line ."<br>";
    }   
}

问题是它已经运行（成功）超过一个小时，并且可能在较小文件中的160,000行中运行3,000行。有什么想法吗？

Answer 1

首先尝试排序文件（特别是大文件）。然后你只需要检查b中每行的前几个字符，并在超过该前缀时停止（转到a中的下一行）。然后你甚至可以建立一个索引，在文件中每个字符的位置是第一个（从第0行开始，b从第1337行开始，c在第13986行开始，依此类推）。

Answer 2

尝试在循环中使用ob_flush()和flush()。

foreach($small_list as $one_line)
{
 if(stristr($big_list, $one_line) != FALSE) 
    {
    fwrite($fh, $one_line);
    echo "record found: " . $one_line ."<br>";
    }  
       @ob_flush();
        @flush();
        @ob_end_flush(); 
}

Answer 3

使用哈希作为索引构建数组：

逐行读入文件a.csv并存储在a_hash[md5($line)] = array($offset, $length)中逐行读入文件b.csv并存储在b_hash[md5($line)] = true

中

通过使用散列作为索引，您将自动不会出现重复的条目。

然后，对于a_hash和b_hash中都有索引的每个哈希，读取文件内容（使用存储在a_hash中的偏移量和长度）来拉出实际的行文本。如果你对哈希冲突有偏执，那么也存储b_hash的偏移/长度并用stristr验证。

这样运行速度会快得多，并且耗尽远，远，FAR更少的内存。

如果您想进一步降低内存需求，不介意检查重复内容，那么：

逐行读入文件a.csv并存储在a_hash[md5($line)] = false中逐行读入文件b.csv，对行进行散列并检查是否存在于a_hash中如果a_hash[md5($line)] == false写信给c.csv并设置a_hash[md5($line)] = true

第二个建议的一些示例代码：

$a_file = fopen('a.csv','r');
$b_file = fopen('b.csv','r');
$c_file = fopen('c.csv','w+');

if(!$a_file || !$b_file || !$c_file) {
    echo "Broken!<br>";
    exit;
}

$a_hash = array();

while(!feof($a_file)) {
    $a_hash[md5(fgets($a_file))] = false;
}
fclose($a_file);

while(!feof($b_file)) {
    $line = fgets($b_file);
    $hash = md5($line);
    if(isset($a_hash[$hash]) && !$a_hash[$hash]) {
        echo 'record found: ' . $line . '<br>';
        fwrite($c_file, $line);
        $a_hash[$hash] = true;
    }
}

fclose($b_file);
fclose($c_file);

严格和速度

3 个答案: