我有一个大的序列文件存储文件的tfidf值。每行代表行,列是每个术语的tfidfs值(该行是稀疏向量)。我想使用Hadoop为每个文档选择前k个单词。天真的解决方案是遍历映射器中每一行的所有列并选择top-k但随着文件变得越来越大我不认为这是一个很好的解决方案。在Hadoop中有更好的方法吗?
答案 0 :(得分:0)
1. In every map calculate TopK (this is local top K for each map)
2. Spawn a signle reduce , now top K from all mappers will flow to this reducer and hence global Top K will be evaluated.
将问题视为
1. You have been given the results of X number of horse races.
2. You need to find Top N fastest horse.