我使用case来编写Map Reducing代码,我必须将对应于同一队列的值分组:
INPUT:
A,B
A,C
B,A
B,D
输出:
A {B,C}
B {A,D}
我写了以下内容:
import java.io.IOException; import java.util.StringTokenizer; /* * All org.apache.hadoop packaged can be imported using the jar present in lib * directory of this java project. */ import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class GroupKeyValues { public static class Map extends Mapper { public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException { Text myKey = new Text(); Text myVal = new Text(); String line = value.toString(); StringTokenizer st = new StringTokenizer(line); while (st.hasMoreTokens()) { String thisH = st.nextToken(); String[] splitData = thisH.split(","); myKey.set(splitData[0]); myVal.set(splitData[1]); } con.write(myKey, myVal); } } @SuppressWarnings("deprecation") public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); @SuppressWarnings("deprecation") Job job = new Job(conf, "GroupKeyValues"); job.setJarByClass(GroupKeyValues.class); job.setMapperClass(Map.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); Path outputPath = new Path(args[1]); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); outputPath.getFileSystem(conf).delete(outputPath); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
答案 0 :(得分:1)
您缺少将值合并为单个“行”值的reducer。例如,您可以像这样使用ArrayWritable:
public static class AggregatingReducer extends Reducer<Text, Text, Text, ArrayWritable> {
private ArrayWritable result = new ArrayWritable(Text.class);
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
List<Text> list = new ArrayList<>();
for (Text value : values) {
list.add(value);
}
result.set(list.toArray(new Text[list.size()]));
context.write(key, result);
}
}
在作业设置中,请务必添加以下内容:
job.setReducerClass(AggregatingReducer.class);
job.setOutputValueClass(ArrayWritable.class); //instead of Text.class
或者(取决于你需要的)你可以将reducer值连接到StringBuilder并发出Text而不是将它累积到ArrayWritable并作为ArrayWritable发出。
<强>更新强> 以下是StringBuilder与逗号分隔符一起使用的示例:
public static class AggregatingReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (Text value : values) {
if (sb.length() != 0) {
sb.append(',');
}
sb.append(value);
}
result.set(sb.toString());
context.write(key, result);
}
}
在驱动程序值类型中需要更改回Text:
job.setOutputValueClass(Text.class);
答案 1 :(得分:0)
您是否考虑使用Apache Spark解决此问题? 代码可能看起来像这样
import org.apache.spark.sql.functions._
val df = sqlContext.createDataFrame(Seq(("A","B"),("A","C"),("B","A"),("B","D")))
val dfAgg = df.groupBy("_1").agg(collect_list("_2"))
dfAgg.show()