Question

我有一个映射器，它输出一个句子中的每个字母，这是键，数字1作为其值。例如，我的映射器输出＆＃39;你好吗＆＃39;如

H 1
o 1
w 1
a 1
r 1
e 1
y 1
o 1
u 1

我的reducer使用它并使用1来计算每个字母的出现次数。例如，它会输出字母＆＃39; o＆＃39;作为一个键，2作为其值，因为它出现两次。

我的问题是我想计算一个句子中每个字母出现的频率。为此，我需要访问句子中的字母总数（映射器的输出数量）。我是mapreduce的新手，所以我不确定最好的方法。

Answer 1

假设您的映射器正在获取一个完整的句子，您正在尝试查找频率而您正在使用Java API，则可以通过context.write(...)函数从映射器输出两个键：

mapper的java语法：public void map(LongWritable key, Text value, Context context)

Key：<lineNo_Letter>;价值：c_m
key：<lineNo_Letter>;价值：t_n

其中

lineNo = same as key to the mapper (the first parameter to the above function)
letter = your desired letter
m = <total number of letters in the line (the 2nd parameter to the above function) input to the mapper>
n = <number of occurrence of letter in the line (the 2nd parameter to the above function) mapper input line>

c_和a_只是识别计数类型的前缀。 c表示信件的出现;而t代表总发生次数。

基本上我们在这里利用这个概念，你可以从mapper / reducer中编写你想要的许多键值。

现在减速器会得到类似的东西密钥：<lineNo_letter>值：ListOf[c_m, t_n]

现在，只需迭代列表，使用分隔符_并使用标识符前缀（t和c）进行拆分;你在减速机中有所需的值。即。

Total number of letter in the sentence = m
Total number of occurrence of the letter = n

编辑：添加伪逻辑

举个例子，假设映射器函数public void map(LongWritable key, Text value, Context context)的输入行是

LongWritable key = 1
Text value = howareyou

mapper的输出应为：

-- Output length of the Text Value against each letter
context.write("1_h", "t_9");
context.write("1_o", "t_9");
context.write("1_w", "t_9");
context.write("1_a", "t_9");
context.write("1_r", "t_9");
context.write("1_e", "t_9");
context.write("1_y", "t_9");
context.write("1_u", "t_9");

请注意，上面的输出是映射器中每个字母的一次输出。这就是为什么字母o只输出一次（即使它在输入中出现两次）。

映射器代码的更多输出将是

-- Output individual letter count in the input text as 
context.write("1_h", "c_1");
context.write("1_o", "c_2");
context.write("1_w", "c_1");
context.write("1_a", "c_1");
context.write("1_r", "c_1");
context.write("1_e", "c_1");
context.write("1_y", "c_1");
context.write("1_u", "c_1");

同样，您可以看到字母o的值为c_2，因为它在句子中出现两次。

现在将产生8个减速器，每个减速器将获得以下一个键值对：

key: "1_h" value: ListOf["t_9", "c_1"]
key: "1_o" value: ListOf["t_9", "c_2"]
key: "1_w" value: ListOf["t_9", "c_1"]
key: "1_a" value: ListOf["t_9", "c_1"]
key: "1_r" value: ListOf["t_9", "c_1"]
key: "1_e" value: ListOf["t_9", "c_1"]
key: "1_y" value: ListOf["t_9", "c_1"]
key: "1_u" value: ListOf["t_9", "c_1"]

现在在每个reducer中，拆分键以获取行号和字母。遍历值列表以提取总数和字母出现次数。

第1行中的字母h的频率= Integer.parseInt("c_1".split("_")[1])/Integer.parseInt("t_9".split("_")[1])

这是一个伪逻辑供您实现。

Answer 2

不要在看到的时候立即写下每封信。计算所有字符，然后将其与字符一起写入。

然后根据您编写值的方式，您的减速器将会看到

o, [(1,9), (1,9)]

求和，并提取任何一个9，然后除

Answer 3

自己完成：使用全局计数器访问MAP_OUTPUT_RECORDS以获取reducer中映射器输出的总数。

代码：

Configuration conf = context.getConfiguration();
Cluster cluster = new Cluster(conf);
Job currentJob = cluster.getJob(context.getJobID());
long totalCharacters = currentJob.getCounters().findCounter(TaskCounter.MAP_OUTPUT_RECORDS).getValue();

Hadoop MapReduce访问减速器中的映射器输出编号

3 个答案:

编辑：添加伪逻辑