Question

我正在尝试将反向搜索作为map reduce的一部分来实现，其中第一部分是我能够完成的（mapper）。第一部分的输出类似于下面的内容（标题仅供参考，这些不在映射器的实际输出中）

word     frequency     document
------------------------------
tire        1           car
headlight   1           shop
tire        1           car
gas         1           gasstation
beer        1           gasstation
headlight   1           car
tire        1           shop

我正试图达到以下解决方案：

在哪个文件中找到该单词及其频率。（例如轮胎在汽车档案中被发现两次）

到目前为止，我已经尝试使用字典来获取找到该单词的文件，但我无法将其链接到计数，下面是我得到的输出：

{'car':[tire,tire,headlight],'shop':[headlight],'gasstation':[gas,beer]}

预期：

tire           {'car':2,'shop':1}
headlight      {'car':1, 'shop':1}

Answer 1

你想要的是reduce你要对列表中的元素进行分组的字典。

假设你的映射输出是一个像这样的字符串列表：

mapped_data = [
    { 'word': 'tire', 'frequency': 1, 'document': 'car' },
    { 'word': 'headlight', 'frequency': 1, 'document': 'shop' }
]

然后你可以这样做：

def reducer(accumulated, line):
    # We've never seen this word before, create the dict to store the documents
    if line['word'] not in accumulated:
        accumulated[line['word']] = {}

    # We've never seen this word in this document before, initialize the counter.
    if line['document'] not in accumulated[line['word']]:
        accumulated[line['word']][line['document']] = 0

    # Increment th counter
    accumulated[line['word']][line['document']] += line['frequency']

    return accumulated_data

reduce(reducer, mapped_data, {})

这会产生预期的结果：

{
    'tire': {
        'car': 2,
        'shop': 1
    },
    'headlight': {
        ...
    },
    ...
}

Python中的倒置搜索

1 个答案: