Question

我试图通过将它们导入多维数组并使用array_diff函数来找出差异来比较php中的2个csv文件。

我使用的方法是

1）获取预期csv的每条记录并转储到arr1

2）获取实际csv的每条记录并转储到arr2

3）使用array_multisort

对array1进行排序

4）使用array_multisort

对array2进行排序

5）使用array_diff函数比较每条记录（例如arr1 [0] [1] vs arr2 [0] [1]）

我的目标是在最短的时间内使用php脚本比较文件。我发现上面的方法是最短的（尝试最初将csv内容转储到MySQL并使用数据库查询进行比较，但由于某些未知原因，查询工作得非常慢，以至于在超时后崩溃我的Apache服务器）

我在csv中有大小高达300mb的文件，但通常是70k记录，有20列和10mb大小

我正在粘贴我所做的代码（w.r.t上述步骤）

     $header='';

    $file_handle = fopen($fileExp, "r");
    $k=0;

    while ($data=fgetcsv($file_handle,0,$_POST['dl1'])) {

        if(count($data)==1 && $data[0]=='')
            continue;
        else
        {
            $urarr1[$k]='';
            for($i=0;$i<count($data);$i++)
            {



                if(in_array($i,$exclude_cols,true))
                    $rarr1[$k][$i]='NTBT';
                else
                    $rarr1[$k][$i]=trim($data[$i]);

            }   


            $k++;
        }




    }

    fclose($file_handle);



    echo '<br>Exp Record count: '.count($rarr1);
    $header.='<br>Exp Record count: '.count($rarr1);

    $hrow=$rarr1[0];   //fetch header row and then unset it
    unset($rarr1[0]);

    array_multisort($rarr1);   //need to sort on all 20 columns asc

    $rarr1=array_values($rarr1); //re-number the array



       //writing the sorted o/p to file...debugging purposes
    $fp = fopen($_POST['op'].'/file1.csv', 'w');

    foreach ($rarr1 as $fields) {
        fputcsv($fp, $fields);
    }

    fclose($fp);


     //Repeat for actual .csv

    $file_handle = fopen($fileAct, "r");
    $k=0;

    while ($data=fgetcsv($file_handle,0,$_POST['dl2'])) {

        if(count($data)==1 && $data[0]=='')
            continue;
        else
        {
            for($i=0;$i<count($data);$i++)
            {


                if(in_array($i,$exclude_cols,true))
                    $rarr2[$k][$i]='NTBT';
                else
                    $rarr2[$k][$i]=trim($data[$i]);
            }   

            $k++;

        }

    }

    fclose($file_handle);

    unset($file_handle);


    echo '<br>Act Record count: '.count($rarr2);
    $header.='<br>Act Record count: '.count($rarr2);

    unset($rarr2[0]);

    array_multisort($rarr2);

    $rarr2=array_values($rarr2);

    $fp = fopen($_POST['op'].'/file2.csv', 'w');

    foreach ($rarr2 as $fields) {
        fputcsv($fp, $fields);
    }

    fclose($fp);


       ///Comparison logic

    $header.= '<br>';

    $header.= '<table>';
    $header.= '<th>RECORD_ID</th>';
    for($i=0;$i<count($hrow);$i++)
    {
        $header.= '<th>'.$hrow[$i].'_EXP</th>';
        $header.= '<th>'.$hrow[$i].'_ACT</th>';
    }

    $r=array();
    for($i=0;$i<count($rarr1);$i++)
    {

        if(array_diff($rarr1[$i],$rarr2[$i]) || array_diff($rarr2[$i],$rarr1[$i]))
        {

            $r[$i]=array_unique(array_merge(array_keys(array_diff($rarr1[$i],$rarr2[$i])),array_keys(array_diff($rarr2[$i],$rarr1[$i]))));


            foreach($r[$i] as $key=>$v)
            {
                if(in_array($v,$calc_cols))
                {
                    if(abs($rarr1[$i][$v]-$rarr2[$i][$v])<0.2)
                    {
                        unset($r[$i][$key]);
                    }   
                }
                elseif(is_numeric($rarr1[$i][$v]) && is_numeric($rarr2[$i][$v]) && !in_array($v,$calc_cols) && ($rarr1[$i][$v]-$rarr2[$i][$v])==0)
                {
                    unset($r[$i][$key]);
                }   
            }



            if(empty($r[$i]))
                unset($r[$i]);

            if(isset($r[$i]))
            {
                $header.= '<tr>';

                $header.= '<td>'.$i.'</td>';

                for($j=0;$j<count($rarr1[$i]);$j++)
                {

                    if(in_array($j,$r[$i]))
                    {
                        $header.= '<td style="color:orange">'.$rarr1[$i][$j].'</td>';
                        $header.= '<td style="color:orange">'.$rarr2[$i][$j].'</td>';
                    }
                    else
                    {
                        $header.= '<td >'.$rarr1[$i][$j].'</td>';
                        $header.= '<td >'.$rarr2[$i][$j].'</td>';
                    }
                }
                $header.= '</tr>';
            }
        }   

    }   
    $header.= '</table>';



//print_r($r);
    echo '<br>';
    // if(!isset($r))
        // $r[0]=0;

    echo 'Differences :'.count($r)  ;

    $header.= '<br>';
    $header.= 'Differences :'.count($r) ;




    $time_end = microtime(true);
    $execution_time = ($time_end - $time_start)/60; //dividing with 60 will give the execution time in minutes other wise seconds
    echo '<br><b>Total Execution Time:</b> '.$execution_time.' Mins'; //execution time of the script

虽然最初我发现这适用于大多数文件，但后来我发现对于某些文件由于未知原因，array_multisort正在对arr1和arr2进行不同的排序，即使内容看起来相同......我不确定这是发生因为数据类型不匹配但我尝试了类型转换，但它仍然以不同的方式排序但相同的数组

有人可以建议上面的代码可能有什么问题吗？另外，考虑到我上面提到的要求，有没有更方便的方法来实现这一点通过PHP？也许是一个比较.csv文件或其他东西的php插件？

编辑：请求的示例数据。只是一个快照，实际上会有更多的列和行。如上所述，.csv文件大小远远超过10mb！文件1和文件2

        236|INPQR|31-AUG-12|200     |INR|       664|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |38972944.8                              |0                                       |0                                       |38972944.8
        236|INPQR|31-AUG-12|200     |INR|       6653|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        236|INPQR|31-AUG-12|200     |USD|       6655|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        236|INPQR|31-AUG-12|200     |USD|       664|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |63919609.97                             |0                                       |0                                       |63919609.97
        225|INPZQ|31-AUG-12|200     |USD|       6653|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        225|INPZQ|31-AUG-12|200     |USD|       6655|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        225|INPZQ|31-AUG-12|200     |USD|       6652|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |38972944.8                              |0                                       |0                                       |38972944.8
        225|INPZQ|31-AUG-12|200     |INR|       6652|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |63919609.97                             |0                                       |0                                       |63919609.97
        225|INPZQ|31-AUG-12|200     |INR|       6654|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        225|INPZQ|31-AUG-12|200     |INR|       6654|AAAAAA,PPPPP  



        236|INPQR|31-AUG-12|200     |USD|       664|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |63919609.97                             |0                                       |0                                       |63919609.97
        225|INPZQ|31-AUG-12|200     |USD|       6653|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        225|INPZQ|31-AUG-12|200     |USD|       6655|AAAAAA,PPPPP 
        236|INPQR|31-AUG-12|200     |INR|       664|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |38972944.8                              |0                                       |0                                       |38972944.8
        236|INPQT|31-AUG-12|200     |INR|       6653|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        236|INPQR|31-AUG-12|200     |USD|       6655|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        225|INPZQ|31-AUG-12|200     |USD|       6652|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |38972944.8                              |0                                       |0                                       |38972944.8
        225|INPZQ|31-AUG-12|200     |INR|       6652|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |63919609.97                             |0                                       |0                                       |63919609.97
        225|INPZQ|31-AUG-12|200     |USD|       6654|AAAAAA,PPPPP                                                                                                                                                                                           |0                                       |0                                       |0                                       |0                                       |0
        225|INPZQ|31-AUG-12|200     |INR|       6654|AAAAAA,PPPPP

更新：2个csv文件可能包含不同的日期格式，其中每个都可能代表不同格式的数字，如1.csv可能有12-jan-2013和0.01为第1行.... 2 .csv会有01/12/2013和.01 因此，我认为哈希不会起作用

Answer 1

确定2个文件不同吗？我会使用md5_file并比较两个文件的MD5哈希值来检查它们在任何方面是否有所不同。

如果它们不同，我会做类似以下的事情：

$csv_1_path = 'file_1.csv';
$csv_2_path = 'file_2.csv';
$fh_csv_1 = fopen($csv_1_path, 'r');
$fh_csv_2 = fopen($csv_2_path, 'r');
$md5_1 = array();
$md5_2 = array();
while( !feof($fh_csv_1) ) {
  $md5_1[] = md5(fgets($fh_csv_1));
}

while( !feof($fh_csv_1) ) {
  $md5_2[] = md5(fgets($fh_csv_2));
}

$common_records = array_intersect($md5_1, $md5_2);

$records_diff_count = 0;
foreach($md5_1 as $row_index => $md5_rec_1) {
  if ( !in_array($md5_rec_1, $common_records) ) {
    print "Record in file $csv_1_path, row $row_index has no match.\n";
    $records_diff_count++;
  }
}

foreach($md5_2 as $row_index => $md5_rec_2) {
  if ( !in_array($md5_rec_2, $common_records) ) {
    print "Record in file $csv_2_path, row $row_index has no match.\n";
    $records_diff_count++;
  }
}

找到每个文件的行索引后，您可以对文件之间的差异进行更深入的分析。

Answer 2

快速观察

您不需要排序以获得差异
我不确定您是如何有效地将300MB加载到PHP application中的，但显然您没有内存问题，因为我建议您使用SQL或{ {1}}而不是。
您的Map-Reduce有点滑稽，很难断定CSV或|是分隔符。
- 既然你有space哇...使用diff的hash会比整个20列更好..那么你可以找回内容的位置

您的代码的简单版本

20 columns

输出

$csvA = "a.log";
$csvB = "b.log";

echo PHP_EOL;

$hashA = readCSVFile($csvA);
$hashB = readCSVFile($csvB);

// Lines in A not in B
$hash = array_diff($hashA, $hashB);
if (($fp = fopen($csvA, "r")) !== FALSE) {
    foreach ( $hash as $p => $v ) {
        fseek($fp, $p);
        echo implode("|",array_map("trim", fgetcsv($fp, 2024, "|"))), PHP_EOL;
    }
    fclose($fp);
}

使用的功能

236|INPQR|31-AUG-12|200|INR|6653|AAAAAA,PPPPP|0|0|0|0|0
225|INPZQ|31-AUG-12|200|USD|6655|AAAAAA,PPPPP|0|0|0|0|0
225|INPZQ|31-AUG-12|200|INR|6654|AAAAAA,PPPPP|0|0|0|0|0

Answer 3

您可以使用以下行正确排序数据。

<?php
function readCSV($fileName, $delimiter, $exclude_cols = array()) {
    $data = array();
    $fh = fopen($fileName, 'r');
    while ($line = fgetcsv($fh, 0, $delimiter)) {
        if (count($line) == 1 && $line[0] == '') {
            continue;
        }

        for ($i = 0; $i < count($line); $i++) {
            $line[$i] = in_array($i, $exclude_cols, true) ? 'NTBT' : trim($line[$i]);
        }
        $data[] = $line;
    }
    fclose($fh);
    return $data;
}

function sort2dArray($data) {
    $tmp = array();
    $lineCount = count($data);
    foreach ($data as $lineNum => $lineData) {
        foreach ($lineData as $column => $value) {
            $tmp[$column][$lineNum] = $value;
        }
    }

    $multiSortArgs = array();
    foreach ($tmp as $column => &$columnData) {
        array_push($multiSortArgs, &$columnData, SORT_ASC);
    }
    $multiSortArgs[] = &$data;
    call_user_func_array('array_multisort', $multiSortArgs);
    return $data;
}

// ========= Reading and sorting
// The expected data
$data_Exp = readCSV($fileExp, $_POST['dl1']);
$rarr1 = sort2dArray($data_Exp);

// The actual data
$data_Act = readCSV($fileAct, $_POST['dl2']);
$rarr2 = sort2dArray($data_Act);

但是，它不会解决您的问题，除非您希望文件包含一整套完全相同的行，这些行可以随机播放。

如果您的用例包含行丢失或完全不同的可能性，则排序只是解决方案的一半。

Answer 4

有两种不同的方法来比较两个CSV文件。我用一种方法来检查两个文件中的不同行。我考虑到你要从行中删除某些列。

我没有使用排序，因为我检查一行是否在另一个文件中，而不是它在同一个位置。原因很简单：如果一行不匹配并在文件的开头排序，则该行之后的所有行都将不同。

示例：

file1:  file2:
1|a     1|a
2|b     2|b
3|c     3|c
4|d     4|d
5|e     1|e

After sorting

file1:  file2:
1|a     1|a
2|b     1|e
3|c     2|b
4|d     3|c
5|e     4|d

Now the rows 2, 3, 4, and 5 are all marked as different, because they do not match if you check per line. But in fact only 1 row is different.

在下面的代码中，您将看到关于我为什么做某事的评论。我还在几个大型CSV文件（~45mb和100.000行）上测试了代码，并且每次检查得到的行数不到10秒。

<?php
set_time_limit(0);

//create a function to create the CSV arrays.
//If you create the code twice like you did, you are bound to make a mistake or change something in one place and not the other.
//Obviously that could lead to sorting two equal files differently.
function CsvToArray($file) 
{
  $exclude_cols = array(2); //you didnt provide it, so for testig i remove the date col because its always the same

  //load file contents into variable and trim it
  $data = trim(implode('', file($file)));

  //strip \r new line to make sure only \n is used
  $data = str_replace("\r", "", $data);
  //strip all spaces from |
  $data = preg_replace('/\s\s+\|/', '|', $data);
  $data = preg_replace('/\|\s\s+/', '|', $data);
  //strip all spaces from each line
  $data = preg_replace('/\s\s+\n/', "\n", $data);
  $data = preg_replace('/\n\s\s+/', "\n", $data);

  //each line to seperate row
  $data = explode("\n", $data);

  //each col to seperate record
  //This is only needed for comparisment if you want to remove certain cols
  //if thats not needed, you can skip this part    
  foreach($data as $k=>$v)
    $data[$k] = explode('|', $v);

  //get the header. Its always the first row
  //array_shift will return the first element and remove it from the dataset
  $header = array_shift($data);

  //turn the array around, by making the row the key and count howmany times it shows
  $ar = array();
  foreach ($data as $row) {
    //remove unwanted cols
    //if you dont want to remove certain cols, skip this and the implode part and use $ar[$row]++
    foreach($exclude_cols as $c)
      $row[$c] = '';
    //implode the remaining
    $key = implode('', $row);

    //you can use str_to_lower($key) for case insensive matching
    $ar[$key]++;    
  }

  return $ar;
}

function CompareTwoCsv($file1, $file2)
{
  $start = microtime(true);

  $ar1 = CsvToArray($file1);
  $ar2 = CsvToArray($file2);

  //check for differences.
  $diff = 0;
  foreach($ar1 as $k=>$v) {
    //the second array doesnt contain the key (is row) so there is a difference
    if (!array_key_exists($k, $ar2)) {
      $diff+=$v; //all rows that are in the first array are different
      continue;
    }
    $c2 = $ar2[$k];

    if ($v == $c2) //row is in both file an equal number of times
      continue;

    $diff += max($v, $c2) - min($v, $c2); //add the number of different rows
  }

  $ar1_count = count($ar1);
  $ar2_count = count($ar2);

  //if ar2 has more records. Every row that is more, is different.
  if ($ar2_count>$ar1_count)
    $diff += $ar2_count - $ar1_count;

  $end = microtime(true);
  $difftime = $end - $start;

  //debug output
  echo "We found ".$diff." differences in the files. it took ".$difftime." seconds<hr>";
}

//test and test2 are two files with ~100.000 rows based on the data you supplied.
//They have many equal rows in the files, so the array returned from CsvToArray is small
CompareTwoCsv("test.txt", "test.txt");
//We found 0 differences in the files. it took 5.6848769187927 seconds

CompareTwoCsv("test.txt", "test2.txt");
//We found 17855 differences in the files. it took 6.6002569198608 seconds

CompareTwoCsv("test2.txt", "test.txt");
//We found 17855 differences in the files. it took 7.5223989486694 seconds


//randomly generated files with 100.000 rows. Very little duplicate data;

CompareTwoCsv("largescv1.txt", "largescv2.txt");
//We found 98250 differences in the files. it took 5.4302139282227 seconds

?>

结果：

array_multisort以不同方式对2个相同的多维数组进行排序

4 个答案: