Question

问题

我有一个数组列表，我想计算重复项的出现。

例如，如果我有这个：

{{1,2,3},
 {1,0,3},
 {1,2,3},
 {5,2,6},
 {5,2,6},
 {5,2,6}}

我想要这样的地图（或任何相关集合）：

{ {1,2,3} -> 2,
  {1,0,3} -> 1,
  {5,2,6} -> 3 }

我什至可以丢失数组值，我只对基数感兴趣（例如此处的2、1、3）。

我的解决方案

我使用以下算法：

首先对数组进行哈希处理，然后检查每个哈希是否在HashMap<Integer, ArrayList<int[]>>中，我们将其命名为 distinctHash ，其中键是哈希，值是ArrayList，我们将其命名为 rowList ，其中包含此哈希的不同数组（以避免冲突）。
如果哈希不在 distinctHash 中，则将其值1放入另一个计算每次出现次数的HashMap<int[], Long>中，我们将其称为 distinctElements
然后，如果散列在 distinctHash 中，请检查 rowList 中是否包含相应的数组。如果是这样，请增加与在 rowList 中找到的相同数组关联的 distinctElements 中的值。（如果将新数组用作键，则由于它们的引用不同，因此将创建另一个键。）

这是代码，返回的布尔值告诉您是否找到了一个新的独立数组，我将这个函数顺序地应用于所有数组：

    HashMap<int[], Long> distinctElements;
    HashMap<Integer, ArrayList<int[]>> distinctHash;

    private boolean addRow(int[] row) {

        if (distinctHash.containsKey(hash)) {
            int[] indexRow = distinctHash.get(hash).get(0);
            for (int[] previousRow: distinctHash.get(hash)) {
                if (Arrays.equals(previousRow, row)) {
                    distinctElements.put(
                            indexRow,
                            distinctElements.get(indexRow) + 1
                    );
                    return false;
                }
            }
            distinctElements.put(row, 1L);

            ArrayList<int[]> rowList = distinctHash.get(hash);
            rowList.add(row);
            distinctHash.put(hash, rowList);

            return true;

        } else {
            distinctElements.put(row, 1L);

            ArrayList<int[]> newValue = new ArrayList<>();
            newValue.add(row);
            distinctHash.put(hash, newValue);

            return true;
        }
    }

问题

问题是我的算法无法满足我的需要（5,000,000个数组为40s，20,000,000个数组为2h-3h）。用NetBeans进行分析告诉我，散列占用了70％的运行时间（使用Google Guava murmur3_128散列函数）。

还有另一种可能更快的算法吗？如我所说，我对数组值不感兴趣，仅对它们出现的次数感兴趣。我准备牺牲精度来提高速度，因此采用概率算法就可以了。

Answer 1

将int[]包装在实现equals和hashCode的类中，然后构建包装器类的Map进行实例计数。

class IntArray {
    private int[] array;
    public IntArray(int[] array) {
        this.array = array;
    }
    @Override
    public int hashCode() {
        return Arrays.hashCode(this.array);
    }
    @Override
    public boolean equals(Object obj) {
        return (obj instanceof IntArray && Arrays.equals(this.array, ((IntArray) obj).array));
    }
    @Override
    public String toString() {
        return Arrays.toString(this.array);
    }
}

测试

int[][] input = {{1,2,3},
                 {1,0,3},
                 {1,2,3},
                 {5,2,6},
                 {5,2,6},
                 {5,2,6}};
Map<IntArray, Long> map = Arrays.stream(input).map(IntArray::new)
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
map.entrySet().forEach(System.out::println);

输出

[1, 2, 3]=2
[1, 0, 3]=1
[5, 2, 6]=3

注意：上面的解决方案比solution by Ravindra Ranwala更快，并且使用的内存更少，但是它确实需要创建一个额外的类，因此值得商which。 >

对于较小的阵列，请使用下面Ravindra Ranwala的简单解决方案。
对于较大的阵列，上述解决方案可能会更好。

 Map<List<Integer>, Long> map = Stream.of(input)
         .map(a -> Arrays.stream(a).boxed().collect(Collectors.toList()))
         .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

Answer 2

您可以这样做

Map<List<Integer>, Long> result = Stream.of(source)
        .map(a -> Arrays.stream(a).boxed().collect(Collectors.toList()))
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

这是输出，

{[1, 2, 3]=2, [1, 0, 3]=1, [5, 2, 6]=3}

Answer 3

如果该数组所有重复项的元素序列彼此相似，并且每个数组的长度不多，则可以将每个数组映射为int数字，并从方法的最后一部分开始使用。尽管此方法减少了散列时间，但这里有一些假设可能不适用于您的情况。

计算重复数组列表中每个不同的数组出现次数

3 个答案: