我正在尝试编写Map Reduce程序来在两个文本文件之间进行连接。我得到的输出仅用于其中一个键。例如,如果我有一个文件S.txt
,数据为
a4 b3
a3 b4
和另一个文件R.txt
,数据为
b3 c3
b3 c1
b3 c2
b4 c4
我得到了输出
a4 c2
a4 c1
a4 c3
如果S.txt
有
b4 c4
和import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class RSJoin{
public static class SMap extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
context.write(new Text(words[0]), new Text("S\t"+words[1]));
}
}
public static class RMap extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
context.write(new Text(words[1]), new Text("R\t"+words[0]));
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text val : values) {
String [] parts = val.toString().split("\t");
String a=parts[0];
if (a.equals("R")){
for (Text val1 : values){
String [] parts1=val1.toString().split("\t");
String b=parts1[0];
if (b.equals("S")){
context.write(new Text(parts[1]), new Text(parts1[1]));
}
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
@SuppressWarnings("deprecation")
Job job = new Job(conf, "ReduceJoin");
job.setJarByClass(RSJoin.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setReducerClass(Reduce.class);
MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,RMap.class);
MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,SMap.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
}
有
a3 b4
输出为
a3 c4。
这是我的程序
MSI (s) (A0:64) [20:01:44:207]: Executing op: CustomActionSchedule(Action=RegisterEventManifest,ActionType=3073,Source=BinaryData,Target=CAQuietExec,CustomActionData="wevtutil.exe" im "C:\Program Files\nodejs\node_etw_provider.man")
MSI (s) (A0:F8) [20:01:44:217]: Invoking remote custom action. DLL: C:\Windows\Installer\MSI7B6E.tmp, Entrypoint: CAQuietExec
CAQuietExec: Transaction support within the specified resource manager is not started or was shut down due to an error.
CAQuietExec: Error 0x80071a91: Command line returned an error.
CAQuietExec: Error 0x80071a91: CAQuietExec Failed
CustomAction RegisterEventManifest returned actual error code 1603 (note this may not be 100% accurate if translation happened inside sandbox)
Action ended 20:01:44: InstallFinalize. Return value 3.
MSI (s) (A0:64) [20:01:44:528]: User policy value 'DisableRollback' is 0
MSI (s) (A0:64) [20:01:44:528]: Machine policy value 'DisableRollback' is 0
答案 0 :(得分:0)
您的连接逻辑假定R值位于值列表中的S值之前。只有当你看到一个R时,你才会寻找一个S.内部超过值Iterable开始,其中外部为左侧,所以如果S首先出现,你的九个循环就不会找到它。
如果您只有一个R值用于多个S值,请进行二次排序(在键中添加&#34; R&#34;和&#34; S&#34;添加分区并添加分组比较器 - 这是正确的方法)或者有一个变量来保存R值一旦找到它,一个保持S值的列表,直到你找到R值(没有真正地缩放)并且在整个过程中只有一次迭代价值观。