适用于大量可能性的最佳搜索技术

时间:2012-11-22 18:26:09

标签: algorithm search set

我正在尝试在文本文件中搜索大量的可能性。

例如,我想搜索包含唯一名称的文本文件。 现在,如果我找到名字X,那么我想将X存储在另一个文件中。

这里的问题我有超过1000个唯一的名字,我不想为每个唯一名称做1000个搜索调用和if语句。

在java / javascript / php中有没有更好的方法呢?

1 个答案:

答案 0 :(得分:2)

您有一组名称,并且您想要查找哪些名称与另一组名称匹配。

Set<String> namesFromFile = readFile(filename);
Set<String> namesToMatch = readFile(matchingNames);
namesToMatch.retainAll(namedFromFile);

retainAll是O(n)操作,其中n是较小集合的大小。在Java中,一组1000个值的retainAll可能需要几毫秒。

Set.retainAll()执行以下操作

  

仅保留此集合中包含在指定集合中的元素(可选操作)。换句话说,从此集合中删除未包含在指定集合中的所有元素。如果指定的集合也是一个集合,则此操作会有效地修改此集合,使其值为两个集合的交集。


一组1000是如此之小,难以准确测试,所以在这个测试中,我测试一个10倍大,即10000个元素对一组100,000个元素。

public static void main(String... args) {
    Set<String> names1 = generateStrings(100000, 2);
    Set<String> names2 = generateStrings(10000, 3);
    for (int i = 0; i < 10; i++) {
        long start = System.nanoTime();
        Set<String> intersect= new HashSet<String>(names2);
        intersect.retainAll(names1);
        long time = System.nanoTime() - start;
        System.out.printf("The intersect of %,d and %,d elements has %,d and took %.3f ms%n",
                names1.size(), names2.size(), intersect.size(), time / 1e6);
    }
}

private static Set<String> generateStrings(int number, int multiple) {
    Set<String> set = new HashSet<String>();
    for (int i = 0; i < number; i++)
        set.add(Integer.toBinaryString(i * multiple));
    return set;
}

打印

The intersect of 100,000 and 10,000 elements has 5,000 and took 21.173 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 10.785 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 9.597 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 3.414 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.791 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.629 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.689 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.753 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.704 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.645 ms