我正在尝试将以下数据作为Hadoop中的键值对进行阅读。
name: "Clooney, George", release: "2013", movie: "Gravity",
name: "Pitt, Brad", release: "2004", movie: "Ocean's 12",
name: Clooney, George", release: "2004", movie: "Ocean's 12",
name: "Pitt, Brad", release: "1999", movie: "Fight Club"
我需要输出如下:
name: "Clooney, George", movie: "Gravity, Ocean's 12",
name: "Pitt, Brad", movie: "Ocean's 12, Fight Club",
我写了一个Mapper和Reducer如下:
public static class MyMapper
extends Mapper<Text, Text, Text, Text>{
private Text word = new Text();
public void map(Text key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(),",");
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(key, word);
}
}
}
public static class MyReducer
extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
String actors = "";
for (Text val : values) {
actors += val.toString();
}
result.set(actors);
context.write(key, result);
}
}
我还添加了以下配置详细信息:
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
我得到以下输出:
name: "Clooney George" release: "2013" movie: "Gravity" George" release: "2004" movie: "Ocean's 12"
name: "Pitt Brad" release: "2004" movie: "Ocean's 12" Brad" release: "1999" movie: "Fight Club"
好像我甚至无法获得正确的基本键值对。 Hadoop中的键值处理如何?有人可以详细说明这一点,并指出我哪里出错了吗?
感谢。 TM
答案 0 :(得分:1)
您的问题与KeyValueTextInputFormat
不相关输入记录中的引号,只查找您定义的第一个分隔符(逗号),并将Key定义为该字符之前的所有内容,以及值作为第一个分隔符后的所有内容。
因此,您的映射器将作为第一条记录的输入键/值提供以下内容:
name: "Clooney
George", release: "2013", movie: "Gravity",
要解决此问题,我认为您应该切换回使用TextInpurFormat
,然后将提取逻辑委派给您的mapper的map方法。