如何有效地找到top-k元素?

时间:2015-06-10 16:52:24

标签: hadoop mapreduce tf-idf

我有一个大的序列文件存储文件的tfidf值。每行代表行,列是每个术语的tfidfs值(该行是稀疏向量)。我想使用Hadoop为每个文档选择前k个单词。天真的解决方案是遍历映射器中每一行的所有列并选择top-k但随着文件变得越来越大我不认为这是一个很好的解决方案。在Hadoop中有更好的方法吗?

1 个答案:

答案 0 :(得分:0)

 1. In every map calculate TopK (this is local top K for each map)
 2. Spawn a signle reduce , now top K from all mappers will flow to this reducer and hence global Top K will be evaluated. 

将问题视为

 1. You have been given the results of X number of horse races. 
 2. You need to find Top N fastest horse.