Question

假设我在内存中有一本完整的一本（或两本）书，我想计算其中唯一单词的数量，我该如何计算？我对小字符串的天真方法是：

create a simple hash, place it in array of fixed size, increment array element
All words for which hash array has 1 in it, are unique.

对于我非常长的字符串，我想要一个更好的方法。我在C编码。我认为一种方法是使用在块中工作的工作线程并组合结果。有更好的算法吗？

Answer 1

正如@ user3386109已经提到的，Trie将是最佳解决方案。基本思想是创建一个字符树。例如：

                             a
                            / \
                           /   \
                          b     c
                         /     / \
                        /     /   \
                       d     a     b

将包含单词＆＃34; a＆＃34;，＆＃34; ab＆＃34;，＆＃34; abd＆＃34;，＆＃34; ac＆＃34;，＆＃34; aca＆＃34 ;和＆＃34; acb＆＃34;。只需将该方法扩展到Treemap，将每个单词映射到它的相应计数，整个查找变为线性，并且可以通过单词来完成：

trie lookup
trienode node = lookup.root

for char c in input:
    if c == ' ':
        //end of word, increment count
        node.count += 1

        //start with root again
        node = lookup.root
    else:
        //go to matching node in the trie
        if !node.hasChild(c)
            node.insertChild(c)

        node = node.childForChar(c)

if node != lookup.root
     //increment count for last word, if the last char wasn't a space
    node.count += 1

现在只需要分析通过这种方法构建的trie。这可以通过简单地过滤计数大于0的所有节点并列出这些节点的路径及其各自的计数来轻松完成。

您可能希望为标点字符，数字等添加过滤。但是如果正确设计了子节点的查找，这种方法可以扫描O(n)中的整个文本，即使对于子节点的查找表有HashTree，查找仍然可以在对数时间执行，结果是O(n log n)，n是输入文本的长度（输入文本中的字符）。

感谢@PaulHankin做基准测试。结果基本上是：取决于我们可以限制输入字母表的数量，Trie比HashTable更好（由@PaulHankin提出），或者表现更差。如果输入限制为小写字母，则trie比HashTable执行2.6倍，如果我们允许所有256个ASCII字符并使用数组作为查找表，性能会降低到1.3倍的性能HashTable。使用HashMap作为子节点的looup-table会进一步降低使用Trie的算法的运行时HashTable到2x的性能。毕竟这个算法的速度真的取决于你愿意限制输入字母表的大小。

计算长字符串中的唯一单词

1 个答案: