使用Mapreduce为数据添加新列

时间:2015-06-12 10:54:25

标签: hadoop mapreduce

处理数据时是否可以在mapreduce中追加列? 示例:

我有3列[EMPID,EMPNAME,EMP_DEPT]的输入数据集,我想使用mapreduce处理这些数据。在reduce阶段,可以添加新列,例如TIMESTAMP(记录处理时的系统时间戳)。 reducer的输出应该是EMPID,EMPNAME,EMP_DEPT,TIMESTAMP

输入数据:

EMPID EMPNAME EMP_DEPT
1       David   HR
2       Sam     IT

输出数据:

EMPID EMPNAME EMP_DEPT Timestamp
1       David   HR      XX:XX:XX:XX
2       Sam     IT      XX:XX:XX:XX

1 个答案:

答案 0 :(得分:1)

MapReduce的目的似乎只是添加时间戳"列" (关于您的输入和输出示例,EMPID,EMPNAME和EMP_DEPT字段没有其他修改/转换/处理)。如果是这种情况,您唯一需要做的就是在映射器中添加时间戳的读取行("行");然后让reducer加入所有新的"行"。工作流程:

Each input file is splited into many chunks:
(input file) --> spliter --> split1, split2, ..., splitN

Each split content is:
split1 = {"1 David HR", "2 Sam IT"}
split2 = {...}

Splits are assigned to mappers (one per split), which output (key,value) pairs; in this case, it is enough with a common key for all the pairs:
(offset, "1 David HR") --> mapper1 --> ("key", "1 David HR 2015-06-13 12:23:45.242")
(offset, "2 Sam IT") --> mapper1 --> ("key", "2 Sam IT 2015-06-13 12:23:45.243")
...
"..." --> mapper2 --> ...
...

The reducer receives an array, for each different key, with all the pairs outputted by the mappers that have such a key:
("key", ["1 David HR 2015-06-13 12:23:45.242", "2 Sam IT 2015-06-13 12:23:45.243"]) --> reducer --> (output file)

如果你的目标是以某种方式最终处理原始数据,除了时间戳之外,还要在映射器上进行处理。