Question

处理数据时是否可以在mapreduce中追加列？示例：

我有3列[EMPID，EMPNAME，EMP_DEPT]的输入数据集，我想使用mapreduce处理这些数据。在reduce阶段，可以添加新列，例如TIMESTAMP（记录处理时的系统时间戳）。 reducer的输出应该是EMPID，EMPNAME，EMP_DEPT，TIMESTAMP

输入数据：

EMPID EMPNAME EMP_DEPT
1       David   HR
2       Sam     IT

输出数据：

EMPID EMPNAME EMP_DEPT Timestamp
1       David   HR      XX:XX:XX:XX
2       Sam     IT      XX:XX:XX:XX

Answer 1

MapReduce的目的似乎只是添加时间戳＆＃34;列＆＃34; （关于您的输入和输出示例，EMPID，EMPNAME和EMP_DEPT字段没有其他修改/转换/处理）。如果是这种情况，您唯一需要做的就是在映射器中添加时间戳的读取行（＆＃34;行＆＃34;）;然后让reducer加入所有新的＆＃34;行＆＃34;。工作流程：

Each input file is splited into many chunks:
(input file) --> spliter --> split1, split2, ..., splitN

Each split content is:
split1 = {"1 David HR", "2 Sam IT"}
split2 = {...}

Splits are assigned to mappers (one per split), which output (key,value) pairs; in this case, it is enough with a common key for all the pairs:
(offset, "1 David HR") --> mapper1 --> ("key", "1 David HR 2015-06-13 12:23:45.242")
(offset, "2 Sam IT") --> mapper1 --> ("key", "2 Sam IT 2015-06-13 12:23:45.243")
...
"..." --> mapper2 --> ...
...

The reducer receives an array, for each different key, with all the pairs outputted by the mappers that have such a key:
("key", ["1 David HR 2015-06-13 12:23:45.242", "2 Sam IT 2015-06-13 12:23:45.243"]) --> reducer --> (output file)

如果你的目标是以某种方式最终处理原始数据，除了时间戳之外，还要在映射器上进行处理。

使用Mapreduce为数据添加新列

1 个答案: