Question

我有一个哈希数组，每个哈希包含一个超过100,000个元素的大数组。

我编写了这个方法来删除每个内部数组的重复项，只保留元素的一个副本，但遗憾的是，由于数组-，这些大型数组无法使用运营商是如此昂贵。

我试图减少的数据结构如下所示：

[{regex: "st.+", results: ["string1", "string2", "strong"]}, {regex: "string.+", results: ["string1", "string2"]}]

为了澄清，：regex是一个用于从大型数组中查找字符串的正则表达式。这就是为什么类似的正则表达式可能导致数组之间重复的值。

def uniqify(arr)
# This loops over an arry of arrays and compares each 
# array to the next, keeping only the unique values in each array
  i = 0
  while i < arr.length
    a = arr[i][:results]
    j = i + 1
    while j < arr.length
      b = arr[j][:results]
      arr[j][:results] = b - a
      j += 1
    end
    i += 1
  end
  arr
end

我的示例数据的预期输出应为：

[{regex: "st.+", results: ["string1", "string2", "strong"]}, {regex: "string.+", results: []}]

如何让这个循环功能更好？

Answer 1

我认为你的部分问题是你正在进行O（n ^ 2）数组减法（对于每个数组，你检查它前面的所有其他数组，这是很多浪费的努力）。一个改进可能是保留一个包含您在此过程中看到的所有内容的集合。这需要只处理每个数组一次，而且可以廉价地检查它们是否包含元素。

require 'set'

def uniquify!(arrays)
  seen = Set.new
  arrays.each do |array|
    i = 0
    while i < array.length
      current = array[i]
      if seen.include? current
        array.delete_at(i)
      else
        seen.add(current)
        i += 1
      end
    end
  end
end

这会就地修改参数数组（因此我在名称中添加了尾部!）。

从数组数组中删除重复值，保留每个唯一值的一个副本

1 个答案: