Question

我正在寻找一个关于如何索引大量字符串的解决方案 - 比如100 000 000（可能更多），每个字符串的平均长度为50字节（= 5 000 000 000 = 5 GB的数据;然后在UTF16和.NET内存分配，甚至更多）。

然后我想使用索引允许其他进程查询索引中是否存在字符串 - 并且尽可能快。

我已经使用大型基于内存的HashSet进行了一些简单的测试 - 大约有1 000 000个字符串 - 并查找例如HashSet中的50 000个字符串只有几毫秒。

这是我想要实现的一些伪代码：

// 1) create huge disk based HashSet / Index / Lookup
using (var hs = DiskBasedHashSet<string>(@"c:\index.bin", .create)) {
    for each (var s in lotsOfStringsToIndex) {
        hs.Add(s);
    }
}

// 2) use index to check if items exists - this need to be fast
public static class Query {
    static var hs = DiskBasedHashSet<string>(@"c:\index.bin", .read);

    // callable from anywhere, and really fast
    public static QueryItem(string s) {
        return hs.Contains(s);
    }
}

for each (var s in checkForThese) {
    var result = Query.QueryItem(s); 
}

我尝试过使用SQL Server，Lucene.NET和B + Trees，无论有没有分区数据。无论如何，这些解决方案要缓慢，我认为，这项任务的资格过高。 Immagine，创建SQL查询或Lucene Filter的开销，只需检查集合中的字符串。

用于大量字符串的基于磁盘的快速HashSet / Index

0 个答案: