Question

我有一个很大的整数列表（数千），我想从中提取第一个N（大约10-20）个唯一元素。列表中的每个整数大约出现三次。

编写一个算法来做这件事是微不足道的，但我想知道什么是速度和内存最有效的方法。

在我的案例中还有一些额外的限制和信息：

在我的用例中，我在数组上多次提取我的唯一身份，每次都从头开始跳过一些元素。在唯一提取期间，我跳过的元素数量是未知的。我甚至没有上限。因此，排序不是速度效率的（我必须保留数组的顺序）。
整数遍布整个地方，因此作为查找解决方案的位数组是不可行的。
我希望不惜一切代价避免在搜索过程中进行临时分配。

我目前的解决方案大致如下：

  int num_uniques = 0;
  int uniques[16];
  int startpos = 0;

  while ((num_uniques != N) && (start_pos < array_length))
  {
    // a temporary used later.
    int insert_position;

    // Get next element.
    int element = array[startpos++];

    // check if the element exist. If the element is not found
    // return the position where it could be inserted while keeping
    // the array sorted.

    if (!binary_search (uniques, element, num_uniques, &insert_position))
    {

      // insert the new unique element while preserving 
      // the order of the array.

      insert_into_array (uniques, element, insert_position);

      uniques++;
    }
  }

binary_search / insert into array算法可以完成工作，但性能不是很好。 insert_into_array调用会在很多位置周围移动元素，这会降低每个标记的速度。

有什么想法吗？

修改

很棒的答案，伙计们！每个人都应得到一个可接受的答案，但我只能给一个人。我将实现一堆你的想法，并用一些典型的数据进行性能拍摄。具有导致最快实施的想法的那个得到了接受的答案。

我将在现代PC和嵌入式CortexA8-CPU上运行代码，我将以某种方式对结果进行加权。也会发布结果。

编辑：枪战结果

Core-Duo上的计时，在160kb测试数据集上进行100次迭代。

Bruteforce (Pete):            203 ticks
Hash and Bruteforce (Antti):  219 ticks
Inplace Binary Tree (Steven): 390 ticks
Binary-Search (Nils):         438 ticks

http://torus.untergrund.net/code/unique_search_shootout.zip（C-source和testdata）

补充说明：

Inplace Binary Tree绝对是真正的随机分布（我的测试数据倾向于上升）。
Binary-Search在我的testdata上运行得非常好，超过32个uniques。它几乎是线性的。

Answer 1

为什么不开始将数组元素插入到std :: set中，当set有N个元素时停止？保证集不会有重复。它们也保证被排序，因此如果你遍历一个从begin（）到end（）的集合，你将按照运算符＆lt;。

的排序顺序进行排序。

Answer 2

我会尝试在不平衡的二叉树中读取唯一标识。这将节省重新安排唯一身份列表的成本，如果源列表足够随机，插入树中将不会大幅度地失衡。（你可以用二叉树一次性搜索并插入if-not-present。）如果它变得不平衡，那么最坏的情况就是迭代16个元素列表而不是进行二元搜索。

您知道二叉树的最大大小，因此您可以提前预先分配所有必要的内存，这应该不是问题。您甚至可以使用“我已经用完节点的内存”条件来告诉您何时完成。

（编辑：显然人们认为我在这里提倡使用例外。我不是。我可能会提倡实际的常见的lisp风格条件，但不是大多数语言中的逃避延续风格例外。此外，它看起来很像就像他想为此做C一样。）

Answer 3

对于较小的数组（如果你想要前20个元素，平均有10个元素来检查相等），即使你不必插入元素，线性扫描通常也会执行二进制搜索。 / p>

Answer 4

您使用限制所实现的最快时间复杂度是O(n)使用带有O(1)查找的字典而不是用于唯一整数的二叉树。当你能在恒定的时间内查找它们时，为什么还要费心寻找它们呢？

由于你只处理“数以千计的记录”，所以其他任何事情都是微不足道的补充。

Answer 5

不使用唯一的整数存储到数组中，而是使用实际的二叉树。它可以避免重复移动数组元素。

Answer 6

使用二叉树的数组表示。阵列的大小可以是3N。基本上

arr [i] =价值

arr [i + 1] =左子数组索引

arr [i + 2] =右子数组索引

每次插入k时走“树”，如果找不到k，则更新其父[i + 1]或[i + 2]并将其添加到下一个空索引。当阵列中的空间不足时，你就得到了答案。

e.g。

找到42243123的前3个唯一：数组大小= 3 * 3 = 9。

在下表中，“v”是值，“l”是左子索引，“r”是右子索引。

 v  l  r  v  l  r  v  l  r
 _________________________
-1 -1 -1 -1 -1 -1 -1 -1 -1
 4 -1 -1 -1 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1  6  3 -1 -1

你太空了。

数组索引0 mod 3是你的答案。

您可以使用4个小组保存订单：

array [i] = value

array [i + 1] =原始数组中的位置

array [i + 2] =左子索引

array [i + 3] =右子索引

Answer 7

如果你有数千个整数并且每个整数大约出现三次，你的算法应该很快找到N个唯一整数的集合，对于小e大致在N（1 + e）步骤中（假设整数是相对随机排序的））。

这意味着您的算法会将随机整数的N次插入到唯一数组中。插入数字K将在阵列中的平均移位K / 2元素上，产生（N ^ 2）/ 4个移动操作。您的二进制搜索将大致采用N *（log（N）-1）步骤。这会为您的算法产生（N ^ 2）/ 4 + N（log（N）-1）+ N（1 + e）的总复杂度。

我认为你可以更好，例如通过以下方式：

int num_uniques = 0, startpos = 0, k, element;
int uniques[16];

/* Allocate and clear a bit table of 32 * 32 = 1024 bits. */
uint32 bit_table[32], hash;
memzero((void *)(&bit_table), sizeof(bit_table));

while (num_uniques < N && startpos < array_length) {
  element = array[startpos++];

  /* Hash the element quickly to a number from 0..1023 */
  hash = element ^ (element >> 16);
  hash *= 0x19191919;
  hash >>= 22;
  hash &= 1023;

  /* Map the hash value to a bit in the bit table.
     Use the low 5 bits of 'hash' to index bit_table
     and the other 5 bits to get the actual bit. */
  uint32 slot=hash & 31;
  uint32 bit=(1u << (hash >> 5));

  /* If the bit is NOT set, this is element is guaranteed unique. */
  if (!(bit_table[slot] & bit)) {
    bit_table[slot] |= bit;
    uniques[num_uniques++] = element;
  } else { /* Otherwise it can be still unique with probability
              num_uniques / 1024. */
    for (k=0; k<num_uniques; k++) { if (uniques[k] == element) break }
    if (k==num_uniques) uniques[num_uniques++] = element;
  }
}

该算法将在N + N ^ 2/128的预期时间内运行，因为运行内循环（索引变量k）的概率很低。

Answer 8

给出一个名为L

的大小为N的整数列表

迭代L一次，找到数组中的最大值和最小值。

分配（1分配）一个名为A的大小（小...大）的整数数组。将此数组初始化为零

迭代L，使用L（i）下标成A，增加那里找到的整数。

然后进行处理。在L中选择你的起点，然后在列表中向前走，看A（i）。挑选A（i）的任何一组＆gt; 2你想要的。

完成后，丢弃A.

如果您的空间非常短，请使用2位而不是整数，并使用以下解释

00 count = 0
01 count = 1
10 count = 2
11 count > 2

从Array中提取前N个唯一整数

8 个答案: