Question

我有一个文件集合，每个文件有3个字段

USER, ITEM, SCORE
U1, I1, S1
U1, I2, S2
U2, I1, S3
U1, I4, S4
...........

我需要的输出是

 U1   [I1, I2,....I5]    # top 5 items, in the desc of scores, additional items are drooped
 U2   [I1]               # top items if less than 5

任何人都可以给出算法的伪代码，我知道如何获得聚合，让我感到困惑的是如何使用map-reduce切断其他项目。感谢

Answer 1

基本思路是让你的映射器发出键值对:( user，（score，item））。

Hadoop然后按用户分组，并按（分数，项目）排序。您可能需要包含一些自定义比较逻辑，以使其按分数排序，并且可能也按相反的顺序排序，因为您需要5个最大值。

然后，您的reducer可以简单地收集它为每个键遇到的前五个元素。在伪代码中：

def map(user, item, score):
  emit(key=user, value=(score, item))

def compareValues(value1, value2):
  return -1 * compare(value1.score, value2.score)

def reduce(key, values):
  emit(key, values[0:5])

或者，我注意到你包含了一个Hive标签...这可以使用纯Hive，假设您使用的是0.11或更高版本：

select user, collect_set(item) from (
  select user, item, row_number() over (partition by user order by item desc) as r
  from foo
) t where r <= 5 group by user;

可悲的是，看起来Hive并不够聪明，无法将其转化为上述简单的算法;它使用两个MapReduce作业。

编辑：我刚才注意到你之前几乎问了同样的问题。你对dimamah在这个问题上的答案有什么问题吗？事实上，他和我用不同的变量名称独立地提出了相同的查询，这让我有信心这是规范的方法。

Answer 2

我不知道'map-reduce'是什么，但这是使用heap的解决方案。

from collections import defaultdict
from heapq import heappush, heappop

top_items = defaultdict(list)
max_items = 5 
data_file = 'data_file'

for line in open(data_file):
    user, item, score = line.split(',')
    if len(top_items[user]) >= 5:
        heappop(top_items[user])
    heappush(top_items[user], (score, item))

top_items将包含每个用户的前五个项目列表（作为（得分，项目）元组。）

map-reduce构建推荐列表

2 个答案: