Question

对于令人困惑的标题感到抱歉，很难定义......

我想做的是将一系列单词作为hadoop作业和输出行的输入，如下所示：

小写序列频率 - 小写 - 序列序列序列频率

我认为一个例子最好解释一下：

假设我的输入数据是：

the sun
the sun
the sun
The sun
The sun
The Sun

我想以

结束

the sun 6 the sun 3
the sun 6 The sun 2
the sun 6 The Sun 1

如何减少小写序列频率和原始序列频率？

Answer 1

在你的地图功能中：输出键： sequence.toLowerCase（）产值：序列（原样）

在每个值的reduce函数中：

Map<String, Integer> occurrences = new HashMap<String, Integer>();
occurrences.put(key, occurrences.get(key) + 1);
if(!key.equals(value)){
occurrences.put(value, occurrences.get(key) + 1);
}

这只是伪代码。您将收到NPE，因为occurrence事件（键/值）将返回空的第一次。只需为此添加检查。因此，您将获得相同序列的不同上/下案例的出现和计数的映射。

在hadoop中一行减少两次

1 个答案: