Question

我有两个列表A和B，B = A + C - D.所有元素都是唯一的，没有重复。我如何获得以下列表：
（1）新增项目，C
（2）拆除旧物品，D

C和D不超过10000个元素。

修改

废话，对不起家伙 - 忘了重要的细节 - 这些都是文本文件，而不是内存元素。

Answer 1

我认为数组的大小是无关紧要的，除非你真的想要关注这个操作的性能如何，即你需要每单位时间执行一定数量的执行。

如果你只是需要这样做才能完成它，使用array_diff()

对我来说似乎很微不足道

$a = array( 1, 2, 3, 4 );
$b = array( 1, 3, 5, 7 ); // 2 and 4 removed, 5 and 7 added

$c = array_diff( $b, $a ); // [5, 7]
$d = array_diff( $a, $b ); // [2, 4]

Answer 2

执行此操作的最有效方法是首先对列表进行排序，并尽可能少地访问阵列的元素。一个例子：

<?php

sort($a, SORT_NUMERIC);
sort($b, SORT_NUMERIC);
$c = array();
$d = array();
while (($currA = array_pop($a)) !== null) {
        while (($currB = array_pop($b)) !== null) {
                if ($currB == $currA) {
                        // exists in both, skip value
                        continue 2;
                }
                if ($currA > $currB) {
                        // exists in A only, add to D, push B back on to stack
                        $d[] = $currA;
                        $b[] = $currB;
                        continue 2;
                }
                // exists in B only, add to C
                $c[] = $currB;
        }
        // exists in A only, for values of A < all of B
        $d[] = $currA;
}

即使对于长度只有几百个元素的列表，这也会比对array_diff的2次调用执行快几个数量级。

Answer 3

你说你已经有两个文件A和B.

假设您在Unix系统上运行，这是最简单，最快速的解决方案。

system("comm -13 A B > C");
system("comm -23 A B > D");

//read C and D in PHP

Answer 4

function diffLists($listA,$listB) {

  $resultAdded = array();
  $resultRemoved = array();
  foreach($listB AS $item) {
    if (!in_array($item,$listA)) {
       $resultAdded[] = $item;
    }
  }
  foreach($listA AS $item) {
    if (!in_array($item,$listB)) {
      $resultRemoved[] = $item;
    }
  }
  return array($resultAdded,$resultRemoved);
}



$myListA = array('item1','item2','item3');
$myListB = array('item1','item3','item4');
print_r(diffLists($myListA,$myListB));

这应该输出一个包含2个元素的数组。第一个元素是列表B中添加的项目列表，第二个元素是列表B中已删除的项目列表。

Answer 5

如果您希望更有效地使用Levenshtein算法，可能需要尝试，

http://en.wikipedia.org/wiki/Levenshtein_distance

Answer 6

在B中搜索A的每个值（反之亦然）具有O（n ^ 2）复杂度。

对于大量数据，您可能最好对每个列表O（n log n）进行排序，然后通过排序列表单次传递计算添加/删除的元素。（相对容易，因为你知道没有重复。）

PHP代码比较两个大文本文件与~300,000个条目和输出差异

修改

6 个答案: