所以我读了2个csv文件,大约有25k条记录。一个是旧CSV,一个是新CSV。我需要比较新CSV文件中的'primary_contact'字段与旧CSV记录的不同,而旧名称和新CSV中的'name','state'和'city'字段相同。
新CSV:
Array(
[0] => Array
(
[0] => ID
[1] => NAME
[2] => STATE
[3] => CITY
[4] => COUNTY
[5] => ADDRESS
[6] => PHONE
[7] => PRIMARY CONTACT
[8] => POSITION
[9] => EMAIL
)
[1] => Array
(
[0] => 2002
[1] => Abbeville Christian Academy
[2] => Alabama
[3] => Abbeville
[4] => Henry
[5] => Po Box 9 Abbeville, AL 36310-0009
[6] => (334) 585-5100
[7] => Ashley Carlisle
[8] => Athletic Director
[9] => acarlisle@acagenerals.org
)
}
问题是我做了两个foreach嵌套循环进行比较,对于小记录来说很好,但是当我运行包含每个25k记录的旧和新CSV文件时,这个过程需要永远完成。
两个CSV中都有一些重复,所以我先删除它们;
function multi_unique($data){
$data = array_reverse($data);
$result = array_reverse( // Reverse array to the initial order.
array_values( // Get rid of string keys (make array indexed again).
array_combine( // Create array taking keys from column and values from the base array.
array_column($data, 1),
$data
)
)
);
return $result;
}
$old_csv=multi_unique($old_csv);
$new_csv=multi_unique($new_csv);
这是我的比较代码,我需要更快的东西;
$name_index_no = 1;
$state_index_no = 2;
$city_index_no = 3;
$country_index_no = 4;
$address_index_no = 5;
$primary_contact_index_no = 7;
$new_export_records[] = $old_csv[0];
foreach($new_csv as $key=>$value){
foreach($old_csv as $key1=>$value1){
if( $old_csv[$key1][$state_index_no] == $new_csv[$key][$state_index_no] &&
$old_csv[$key1][$city_index_no] == $new_csv[$key][$city_index_no] &&
$old_csv[$key1][$name_index_no] == $new_csv[$key][$name_index_no] ){
if($old_csv[$key1][$primary_contact_index_no] !=
$new_csv[$key][$primary_contact_index_no]){
$new_export_records[] = $new_csv[$key];
}
unset($old_csv[$key1]);
break;
}
}
}
答案 0 :(得分:3)
正如Michael指出的那样,您当前的解决方案运行n * m
次。每个都是25k,这太过分了。但是,如果先运行旧数据,创建索引,然后运行新数据并检查该索引,则将在m + n
次迭代中完成。
一个例子是:
$name_index_no = 1;
$state_index_no = 2;
$city_index_no = 3;
$country_index_no = 4;
$address_index_no = 5;
$primary_contact_index_no = 7;
$genKey = function ($row, $glue = '|') use ($state_index_no, $city_index_no, $name_index_no) {
return implode($glue, [
$row[$state_index_no],
$row[$city_index_no],
$row[$name_index_no],
]);
};
// create an index using the old data
$t = microtime(true);
$index = [];
foreach ($old_csv as $row) {
$index[$genKey($row)] = $row;
}
printf('index generation: %.5fs', microtime(true) - $t);
// collect changed/new entries
$t = microtime(true);
$changed = [];
$new = [];
foreach ($new_csv as $row) {
$key = $genKey($row);
// key doesn't exist => new entry
if (!isset($index[$key])) {
$new[] = $row;
}
// primary contact differs => changed entry
elseif ($row[$primary_contact_index_no] !== $index[$key][$primary_contact_index_no]) {
$changed[] = $row;
}
}
printf('comparison: %.5fs', microtime(true) - $t);
print_r($changed);
print_r($new);