Question

我有一个大的序列文件存储文件的tfidf值。每行代表行，列是每个术语的tfidfs值（该行是稀疏向量）。我想使用Hadoop为每个文档选择前k个单词。天真的解决方案是遍历映射器中每一行的所有列并选择top-k但随着文件变得越来越大我不认为这是一个很好的解决方案。在Hadoop中有更好的方法吗？

Answer 1

 1. In every map calculate TopK (this is local top K for each map)
 2. Spawn a signle reduce , now top K from all mappers will flow to this reducer and hence global Top K will be evaluated.

将问题视为

 1. You have been given the results of X number of horse races. 
 2. You need to find Top N fastest horse.

如何有效地找到top-k元素？

1 个答案: