Question

我是Hadoop的新手，我正在尝试实现一种算法，该算法只计算长度为x的子字符串的出现次数。它很长但很简单。

这是一个输入的实用示例："ABCABCAGD" x=4, m=2

地图

提取长度为x的子串（我们称之为x-string）：
```
ABCA,BCAB,CABC,ABCA,BCAG,CAGD`
```
对于每个x字符串，我提取其＃34;签名＆＃34;，定义为长度为m的词典次要子字符串：
```
AB, AB, AB, AB, AG, AG
```
现在每个＆＃34;签名＆＃34;我生成另一个字符串如下：
我连接具有相同签名和连续的x字符串。在示例中，有2个签名AB，CB。 x-string属于两个签名都是连续的，因此我的Map任务的输出是：
```
Key=AB; Value=ABCABCA
Key=AG; Value=BCAGD
```
（正如你可以看到2个有条件的x字符串我只附加了最后一个字符，第一个值是ABCA + B + C + A的结果）

合

现在我再次从地图输出中提取x字符串，我的合成器输出为：
```
Key=ABCA,Value=1
Key=BCAB,Value=1
Key=CABC,Value=1
Key=ABCA,Value=1
```
（属于第一张地图输出 - ＆gt; Key=AB; Value=ABCABCA）
```
Key=BCAG,Value=1
Key=CAGD,Value=1
```
（属于第二张地图输出 - ＆gt; Key=AG; Value=BCAGD）

减速

现在我应该只计算每个x字符串的出现次数（是的算法只是这样做）

这应该是Reduce输出：
```
ABCA:2
BCAB:1
CABC:1
BCAG:1
CAGD:1
```
问题的输出是：
```
ABCA:1
ABCA:1
BCAB:1
CABC:1
BCAG:1
CAGD:1
```

我的reducer目前与WordCount非常相似，只是迭代并对值进行求和。我非常确定Reduce Task（我用setNumReduceTasks(1)设置MR作业）以某种方式输出错误的输出，因为它没有将所有数据放在一起。

您如何看待这种结构？

我选择在Combiner步骤中进行x-strings提取，这是正确的地方还是我的问题的一部分？

请注意：由于算法的结果，组合器的输出记录多于输入...这可能是个问题吗？

代码（从非hadoop逻辑简化）

public class StringReader extends Mapper<NullWritable, RecordInterface, LongWritable, BytesWritable> {
    public void map(NullWritable key, RecordInterface value, Context context) throws IOException, InterruptedException {
            HadoopRun.util.extractSuperKmersNew(value.getValue().getBytes(), context);
    }
}



public void extractSuperKmersNew(byte[] r1, Mapper<NullWritable, RecordInterface, LongWritable, BytesWritable>.Context context) {

....
    context.write(new LongWritable(current_signature),new BytesWritable(super_kmer));
....
}


public class Combiner extends Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable> {

    protected void reduce(LongWritable arg0, Iterable<BytesWritable> arg1,
            Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable>.Context context)
            throws IOException, InterruptedException {
        for (BytesWritable val : arg1) {
            extractKmers(val.get(), context)            

        }
    }
}


public void extractKmers(byte[] superkmer, Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable>.Context arg2) {    


int end = superkmer.length - k + 1;

//Extraction of k-string from the aggregated strings
for (int i = 0; i < end; i++) {
    long l = byteArrayToLong(superkmer, i);
    try {
    // quickfix to send to reducer Key = K-string, Value=1
    byte[] ONE = new byte[1];
    ONE[0] = 1;
        arg2.write(new LongWritable(l), new BytesWritable(ONE));
    } catch (IOException | InterruptedException e) {

    }
}
}



public class CounterReducer extends Reducer<LongWritable, BytesWritable, Text, IntWritable> {

    protected void reduce(LongWritable kmer, Iterable<BytesWritable> count,
            Reducer<LongWritable, BytesWritable, Text, IntWritable>.Context context)
            throws IOException, InterruptedException {

 int sum=0;
 for (BytesWritable val : count) {
        sum +=1
      }
      context.write(new Text(LongWritableToText(key), new IntWritable(sum));

    }
}


public class HadoopRun extends Configured implements Tool {

    public static Utility util;

    public int run(String[] args) throws Exception {
        /* HADOOP START */
        Configuration conf = this.getConf();
        Job job = Job.getInstance(conf, "Mapping Strings");
        job.setJarByClass(HadoopRun.class);
        job.setMapperClass(StringReader.class);
        job.setCombinerClass(Combiner.class);
        job.setReducerClass(CounterReducer.class);
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(BytesWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setOutputFormatClass(TextOutputFormat.class); 
        job.setPartitionerClass(KPartitioner.class); 


        job.setNumReduceTasks(1);
        job.setInputFormatClass(FASTAShortInputFileFormat.class);
        FASTAShortInputFileFormat.addInputPath(job, new Path(conf.get("file_in")));
        FileOutputFormat.setOutputPath(job, new Path(conf.get("file_out")));
        return job.waitForCompletion(true) ? 0 : 1;
    }

public static void main(String[] args) throws Exception {

        //...
        //managing input arguments
        //...

        CommandLineParser parser = new BasicParser();
        HelpFormatter formatter = new HelpFormatter();
        try {
            cmd = parser.parse(options, args);
        } catch (ParseException e) {
            formatter.printHelp("usage:", options);
            System.exit(1);
        }
        Integer k = Integer.parseInt(cmd.getOptionValue("k"));
        Integer m = Integer.parseInt(cmd.getOptionValue("m"));
        String file_in_string = cmd.getOptionValue("file_in");
        String file_out_string = cmd.getOptionValue("file_out");
        Configuration conf = new Configuration();
        conf.set("file_in", file_in_string);
        conf.set("file_out", file_out_string);
        util = new Utility(k, m);
        int res = ToolRunner.run(conf, new HadoopRun(), args);

        System.exit(res);
    }

}

Answer 1

如果你想要一个方法来计算长度x

的任何子串的发生

    /**
     * 
     * @param s string to get substrings from 
     * @param x length of substring you want
     * @return hashmap with each substring being the keys and amount of occurrences being the values 
     */
    public static HashMap countSub(String s,int x){
        HashMap<String,Integer> hm = new HashMap();
        int to = s.length()-x+1;
        hm.put(s.substring(0,x),1);


        for(int i=1;i<to;i++){
            x++;
            String next = s.substring(i,x);
            boolean b = false;

            for (String key : hm.keySet()) {
                b = key.equals(next);
                //if key already exists increment value
                if(b) {
                    hm.put(key,hm.get(key)+1);
                    break;
                }
            }
            //else make new key
            if(!b) hm.put(next,1);
        }
        return hm;
    }

该方法返回一个哈希映射，它看起来有点像你问题中的格式，其中每个键都是一个子字符串，每个值都是该子字符串的出现次数。

分布式子串计数

1 个答案: