我是Hadoop的新手,我正在尝试实现一种算法,该算法只计算长度为x
的子字符串的出现次数。它很长但很简单。
这是一个输入的实用示例:"ABCABCAGD" x=4, m=2
地图
提取长度为x
的子串(我们称之为x-string):
ABCA,BCAB,CABC,ABCA,BCAG,CAGD`
对于每个x字符串,我提取其#34;签名",定义为长度为m
的词典次要子字符串:
AB, AB, AB, AB, AG, AG
现在每个"签名"我生成另一个字符串如下:
我连接具有相同签名和连续的x字符串。
在示例中,有2个签名AB
,CB
。 x-string属于两个签名都是连续的,因此我的Map任务的输出是:
Key=AB; Value=ABCABCA
Key=AG; Value=BCAGD
(正如你可以看到2个有条件的x字符串我只附加了最后一个字符,第一个值是ABCA + B + C + A
的结果)
合
现在我再次从地图输出中提取x字符串,我的合成器输出为:
Key=ABCA,Value=1
Key=BCAB,Value=1
Key=CABC,Value=1
Key=ABCA,Value=1
(属于第一张地图输出 - > Key=AB; Value=ABCABCA
)
Key=BCAG,Value=1
Key=CAGD,Value=1
(属于第二张地图输出 - > Key=AG; Value=BCAGD
)
减速
现在我应该只计算每个x字符串的出现次数(是的算法只是这样做)
这应该是Reduce输出:
ABCA:2
BCAB:1
CABC:1
BCAG:1
CAGD:1
问题的输出是:
ABCA:1
ABCA:1
BCAB:1
CABC:1
BCAG:1
CAGD:1
我的reducer目前与WordCount非常相似,只是迭代并对值进行求和。
我非常确定Reduce Task(我用setNumReduceTasks(1)
设置MR作业)以某种方式输出错误的输出,因为它没有将所有数据放在一起。
您如何看待这种结构?
我选择在Combiner步骤中进行x-strings提取,这是正确的地方还是我的问题的一部分?
请注意:由于算法的结果,组合器的输出记录多于输入...这可能是个问题吗?
代码(从非hadoop逻辑简化)
public class StringReader extends Mapper<NullWritable, RecordInterface, LongWritable, BytesWritable> {
public void map(NullWritable key, RecordInterface value, Context context) throws IOException, InterruptedException {
HadoopRun.util.extractSuperKmersNew(value.getValue().getBytes(), context);
}
}
public void extractSuperKmersNew(byte[] r1, Mapper<NullWritable, RecordInterface, LongWritable, BytesWritable>.Context context) {
....
context.write(new LongWritable(current_signature),new BytesWritable(super_kmer));
....
}
public class Combiner extends Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable> {
protected void reduce(LongWritable arg0, Iterable<BytesWritable> arg1,
Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable>.Context context)
throws IOException, InterruptedException {
for (BytesWritable val : arg1) {
extractKmers(val.get(), context)
}
}
}
public void extractKmers(byte[] superkmer, Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable>.Context arg2) {
int end = superkmer.length - k + 1;
//Extraction of k-string from the aggregated strings
for (int i = 0; i < end; i++) {
long l = byteArrayToLong(superkmer, i);
try {
// quickfix to send to reducer Key = K-string, Value=1
byte[] ONE = new byte[1];
ONE[0] = 1;
arg2.write(new LongWritable(l), new BytesWritable(ONE));
} catch (IOException | InterruptedException e) {
}
}
}
public class CounterReducer extends Reducer<LongWritable, BytesWritable, Text, IntWritable> {
protected void reduce(LongWritable kmer, Iterable<BytesWritable> count,
Reducer<LongWritable, BytesWritable, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
int sum=0;
for (BytesWritable val : count) {
sum +=1
}
context.write(new Text(LongWritableToText(key), new IntWritable(sum));
}
}
public class HadoopRun extends Configured implements Tool {
public static Utility util;
public int run(String[] args) throws Exception {
/* HADOOP START */
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "Mapping Strings");
job.setJarByClass(HadoopRun.class);
job.setMapperClass(StringReader.class);
job.setCombinerClass(Combiner.class);
job.setReducerClass(CounterReducer.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(BytesWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setPartitionerClass(KPartitioner.class);
job.setNumReduceTasks(1);
job.setInputFormatClass(FASTAShortInputFileFormat.class);
FASTAShortInputFileFormat.addInputPath(job, new Path(conf.get("file_in")));
FileOutputFormat.setOutputPath(job, new Path(conf.get("file_out")));
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
//...
//managing input arguments
//...
CommandLineParser parser = new BasicParser();
HelpFormatter formatter = new HelpFormatter();
try {
cmd = parser.parse(options, args);
} catch (ParseException e) {
formatter.printHelp("usage:", options);
System.exit(1);
}
Integer k = Integer.parseInt(cmd.getOptionValue("k"));
Integer m = Integer.parseInt(cmd.getOptionValue("m"));
String file_in_string = cmd.getOptionValue("file_in");
String file_out_string = cmd.getOptionValue("file_out");
Configuration conf = new Configuration();
conf.set("file_in", file_in_string);
conf.set("file_out", file_out_string);
util = new Utility(k, m);
int res = ToolRunner.run(conf, new HadoopRun(), args);
System.exit(res);
}
}
答案 0 :(得分:0)
如果你想要一个方法来计算长度x
的任何子串的发生 /**
*
* @param s string to get substrings from
* @param x length of substring you want
* @return hashmap with each substring being the keys and amount of occurrences being the values
*/
public static HashMap countSub(String s,int x){
HashMap<String,Integer> hm = new HashMap();
int to = s.length()-x+1;
hm.put(s.substring(0,x),1);
for(int i=1;i<to;i++){
x++;
String next = s.substring(i,x);
boolean b = false;
for (String key : hm.keySet()) {
b = key.equals(next);
//if key already exists increment value
if(b) {
hm.put(key,hm.get(key)+1);
break;
}
}
//else make new key
if(!b) hm.put(next,1);
}
return hm;
}
该方法返回一个哈希映射,它看起来有点像你问题中的格式,其中每个键都是一个子字符串,每个值都是该子字符串的出现次数。