在map reduce程序中只获取一个键的输出

时间:2015-09-19 07:32:56

标签: mapreduce

我正在尝试编写Map Reduce程序来在两个文本文件之间进行连接。我得到的输出仅用于其中一个键。例如,如果我有一个文件S.txt,数据为

a4 b3
a3 b4

和另一个文件R.txt,数据为
b3 c3
b3 c1
b3 c2
b4 c4

我得到了输出

a4 c2
a4 c1
a4 c3

如果S.txt有 b4 c4

import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class RSJoin{ public static class SMap extends Mapper<Object, Text, Text, Text>{ public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split(" "); context.write(new Text(words[0]), new Text("S\t"+words[1])); } } public static class RMap extends Mapper<Object, Text, Text, Text>{ public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split(" "); context.write(new Text(words[1]), new Text("R\t"+words[0])); } } public static class Reduce extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for (Text val : values) { String [] parts = val.toString().split("\t"); String a=parts[0]; if (a.equals("R")){ for (Text val1 : values){ String [] parts1=val1.toString().split("\t"); String b=parts1[0]; if (b.equals("S")){ context.write(new Text(parts[1]), new Text(parts1[1])); } } } } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); @SuppressWarnings("deprecation") Job job = new Job(conf, "ReduceJoin"); job.setJarByClass(RSJoin.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setReducerClass(Reduce.class); MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,RMap.class); MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,SMap.class); job.setOutputFormatClass(TextOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(args[2])); job.waitForCompletion(true); } } 有 a3 b4

输出为
a3 c4。

这是我的程序

MSI (s) (A0:64) [20:01:44:207]: Executing op: CustomActionSchedule(Action=RegisterEventManifest,ActionType=3073,Source=BinaryData,Target=CAQuietExec,CustomActionData="wevtutil.exe" im "C:\Program Files\nodejs\node_etw_provider.man")
MSI (s) (A0:F8) [20:01:44:217]: Invoking remote custom action. DLL: C:\Windows\Installer\MSI7B6E.tmp, Entrypoint: CAQuietExec
CAQuietExec:  Transaction support within the specified resource manager is not started or was shut down due to an error.
CAQuietExec:  Error 0x80071a91: Command line returned an error.
CAQuietExec:  Error 0x80071a91: CAQuietExec Failed
CustomAction RegisterEventManifest returned actual error code 1603 (note this may not be 100% accurate if translation happened inside sandbox)
Action ended 20:01:44: InstallFinalize. Return value 3.
MSI (s) (A0:64) [20:01:44:528]: User policy value 'DisableRollback' is 0
MSI (s) (A0:64) [20:01:44:528]: Machine policy value 'DisableRollback' is 0

1 个答案:

答案 0 :(得分:0)

您的连接逻辑假定R值位于值列表中的S值之前。只有当你看到一个R时,你才会寻找一个S.内部超过值Iterable开始,其中外部为左侧,所以如果S首先出现,你的九个循环就不会找到它。

如果您只有一个R值用于多个S值,请进行二次排序(在键中添加&#34; R&#34;和&#34; S&#34;添加分区并添加分组比较器 - 这是正确的方法)或者有一个变量来保存R值一旦找到它,一个保持S值的列表,直到你找到R值(没有真正地缩放)并且在整个过程中只有一次迭代价值观。