Question

我实际上正在尝试使用python为某个数据开发一个mapper和reducer。我编写了映射器代码，它将提供商店名称以及在商店完成的交易成本。

例如：

Nike $45.99 Adidas $72.99 Puma $56.99 Nike $109.99 Adidas $85.99

这里的关键是商店名称，价值是交易成本。现在我正在尝试编写reducer代码，它将比较每个商店的交易成本，并在每个商店提供最高的交易。

现在我要获得的输出是

Nike $109.99 Adidas $85.99 Puma $56.99

我的问题是如何比较python中给键的不同值？

Answer 1

嗯，MapReduce范例是每个映射器应以精确格式输出的键值对。

对于reducer，hadoop框架保证每个使用shuffle-sort算法的reducer将获得某个键的所有值，因此两个不同的reducer不会从同一个键获得不同的条目。 / p>

但是，reducer可以有多个键值来处理。

至于你的问题，我们假设你有相同的3个不同的值，例如：

Nike $109.99
Nike $45.99
Nike $294.99

reducer将首先获得2个值，因此基于键的reducer函数将获得值：

$109.99
$45.99

并且需要使用简单的比较输出最高的一个，输出应该是$109.99，这将是你的reducer函数第二次运行的输入，这次是输入：

$109.99
$294.99

再次，使用比较，您应输出最高值，即：$294.99

至于代码，你需要一个非常简单的函数，例如：

编辑：我认为您的分隔符是制表符，但您可以将格式更改为您正在使用的格式

#!/usr/bin/env python

import sys

current_word = None
current_max_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        if count > current_max_count:
            current_max_count = count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_max_count)
        current_max_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_max_count)

Answer 2

def largets_value(_dict):
    d = {}
    for i, v in enumerate(_dict.keys()):
        d[v] = max(_dict.values()[i])
    return d

def dict_from_txt(file, sep):
    d = {}
    f = [x.rstrip().replace('$', '').split(sep) for x in open(file, 'rb').readlines()]
    for i in f:
        if i[0] in d:
            d[i[0]].append(float(i[1]))
        else:
            d[i[0]] = [float(i[1])]
    return d

def dict_from_iterable(iterable, sep):
    d = {}
    f = [x.rstrip().replace('$', '').split(sep) for x in iterable]
    for i in f:
        if i[0] in d:
            d[i[0]].append(float(i[1]))
        else:
            d[i[0]] = [float(i[1])]
    return d

data = ['Nike $45.99',
        'Adidas $72.99',
        'Puma $56.99',
        'Nike $109.99',
        'Adidas $85.99']
print largets_value(dict_from_iterable(data, ' '))
#Uncomment next line and delete the previous to use for yourself
#print largets_value(dict_from_txt('my_file', ' '))

Answer 3

Hadoop应该在将映射器的输出传递给reducer之前对其进行排序。鉴于您可以使用itertools.groupby()将类似的键分组到列表中，然后从每个分组列表中选择最大值：

#!/usr/bin/env python

import sys
from itertools import groupby

for store, transactions in groupby((line.split() for line in sys.stdin),
                                   key=lambda line: line[0]):
    print(store, max(float(amount[1].replace('$', '')) for amount in transactions))

这当然假设您的映射器的输出包含两个用于存储和事务值的空白分隔字段。

比较python中给出一个键的多个值

3 个答案: