处理数据时是否可以在mapreduce中追加列? 示例:
我有3列[EMPID,EMPNAME,EMP_DEPT]的输入数据集,我想使用mapreduce处理这些数据。在reduce阶段,可以添加新列,例如TIMESTAMP(记录处理时的系统时间戳)。 reducer的输出应该是EMPID,EMPNAME,EMP_DEPT,TIMESTAMP
输入数据:
EMPID EMPNAME EMP_DEPT
1 David HR
2 Sam IT
输出数据:
EMPID EMPNAME EMP_DEPT Timestamp
1 David HR XX:XX:XX:XX
2 Sam IT XX:XX:XX:XX
答案 0 :(得分:1)
MapReduce的目的似乎只是添加时间戳"列" (关于您的输入和输出示例,EMPID,EMPNAME和EMP_DEPT字段没有其他修改/转换/处理)。如果是这种情况,您唯一需要做的就是在映射器中添加时间戳的读取行("行");然后让reducer加入所有新的"行"。工作流程:
Each input file is splited into many chunks:
(input file) --> spliter --> split1, split2, ..., splitN
Each split content is:
split1 = {"1 David HR", "2 Sam IT"}
split2 = {...}
Splits are assigned to mappers (one per split), which output (key,value) pairs; in this case, it is enough with a common key for all the pairs:
(offset, "1 David HR") --> mapper1 --> ("key", "1 David HR 2015-06-13 12:23:45.242")
(offset, "2 Sam IT") --> mapper1 --> ("key", "2 Sam IT 2015-06-13 12:23:45.243")
...
"..." --> mapper2 --> ...
...
The reducer receives an array, for each different key, with all the pairs outputted by the mappers that have such a key:
("key", ["1 David HR 2015-06-13 12:23:45.242", "2 Sam IT 2015-06-13 12:23:45.243"]) --> reducer --> (output file)
如果你的目标是以某种方式最终处理原始数据,除了时间戳之外,还要在映射器上进行处理。