分布式子串计数

时间:2016-08-10 23:06:10

标签: java string algorithm hadoop

我是Hadoop的新手,我正在尝试实现一种算法,该算法只计算长度为x的子字符串的出现次数。它很长但很简单。

这是一个输入的实用示例:"ABCABCAGD" x=4, m=2

地图

  1. 提取长度为x的子串(我们称之为x-string):

    ABCA,BCAB,CABC,ABCA,BCAG,CAGD`
    
  2. 对于每个x字符串,我提取其#34;签名",定义为长度为m的词典次要子字符串:

    AB, AB, AB, AB, AG, AG
    
  3. 现在每个"签名"我生成另一个字符串如下:
    我连接具有相同签名和连续的x字符串。 在示例中,有2个签名ABCB。 x-string属于两个签名都是连续的,因此我的Map任务的输出是:

    Key=AB; Value=ABCABCA
    Key=AG; Value=BCAGD
    

    (正如你可以看到2个有条件的x字符串我只附加了最后一个字符,第一个值是ABCA + B + C + A的结果)

    1. 现在我再次从地图输出中提取x字符串,我的合成器输出为:

      Key=ABCA,Value=1
      Key=BCAB,Value=1
      Key=CABC,Value=1
      Key=ABCA,Value=1
      

      (属于第一张地图输出 - > Key=AB; Value=ABCABCA

      Key=BCAG,Value=1
      Key=CAGD,Value=1
      

      (属于第二张地图输出 - > Key=AG; Value=BCAGD

    2. 减速

      1. 现在我应该只计算每个x字符串的出现次数(是的算法只是这样做)

        这应该是Reduce输出:

        ABCA:2
        BCAB:1
        CABC:1
        BCAG:1
        CAGD:1
        

        问题的输出是:

        ABCA:1
        ABCA:1
        BCAB:1
        CABC:1
        BCAG:1
        CAGD:1
        
      2. 我的reducer目前与WordCount非常相似,只是迭代并对值进行求和。 我非常确定Reduce Task(我用setNumReduceTasks(1)设置MR作业)以某种方式输出错误的输出,因为它没有将所有数据放在一起。

        您如何看待这种结构?

        我选择在Combiner步骤中进行x-strings提取,这是正确的地方还是我的问题的一部分?

        请注意:由于算法的结果,组合器的输出记录多于输入...这可能是个问题吗?

        代码(从非hadoop逻辑简化)

        public class StringReader extends Mapper<NullWritable, RecordInterface, LongWritable, BytesWritable> {
            public void map(NullWritable key, RecordInterface value, Context context) throws IOException, InterruptedException {
                    HadoopRun.util.extractSuperKmersNew(value.getValue().getBytes(), context);
            }
        }
        
        
        
        public void extractSuperKmersNew(byte[] r1, Mapper<NullWritable, RecordInterface, LongWritable, BytesWritable>.Context context) {
        
        ....
            context.write(new LongWritable(current_signature),new BytesWritable(super_kmer));
        ....
        }
        
        
        public class Combiner extends Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable> {
        
            protected void reduce(LongWritable arg0, Iterable<BytesWritable> arg1,
                    Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable>.Context context)
                    throws IOException, InterruptedException {
                for (BytesWritable val : arg1) {
                    extractKmers(val.get(), context)            
        
                }
            }
        }
        
        
        public void extractKmers(byte[] superkmer, Reducer<LongWritable, BytesWritable, LongWritable, BytesWritable>.Context arg2) {    
        
        
        int end = superkmer.length - k + 1;
        
        //Extraction of k-string from the aggregated strings
        for (int i = 0; i < end; i++) {
            long l = byteArrayToLong(superkmer, i);
            try {
            // quickfix to send to reducer Key = K-string, Value=1
            byte[] ONE = new byte[1];
            ONE[0] = 1;
                arg2.write(new LongWritable(l), new BytesWritable(ONE));
            } catch (IOException | InterruptedException e) {
        
            }
        }
        }
        
        
        
        public class CounterReducer extends Reducer<LongWritable, BytesWritable, Text, IntWritable> {
        
            protected void reduce(LongWritable kmer, Iterable<BytesWritable> count,
                    Reducer<LongWritable, BytesWritable, Text, IntWritable>.Context context)
                    throws IOException, InterruptedException {
        
         int sum=0;
         for (BytesWritable val : count) {
                sum +=1
              }
              context.write(new Text(LongWritableToText(key), new IntWritable(sum));
        
            }
        }
        
        
        public class HadoopRun extends Configured implements Tool {
        
            public static Utility util;
        
            public int run(String[] args) throws Exception {
                /* HADOOP START */
                Configuration conf = this.getConf();
                Job job = Job.getInstance(conf, "Mapping Strings");
                job.setJarByClass(HadoopRun.class);
                job.setMapperClass(StringReader.class);
                job.setCombinerClass(Combiner.class);
                job.setReducerClass(CounterReducer.class);
                job.setMapOutputKeyClass(LongWritable.class);
                job.setMapOutputValueClass(BytesWritable.class);
        
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(IntWritable.class);
                job.setOutputFormatClass(TextOutputFormat.class); 
                job.setPartitionerClass(KPartitioner.class); 
        
        
                job.setNumReduceTasks(1);
                job.setInputFormatClass(FASTAShortInputFileFormat.class);
                FASTAShortInputFileFormat.addInputPath(job, new Path(conf.get("file_in")));
                FileOutputFormat.setOutputPath(job, new Path(conf.get("file_out")));
                return job.waitForCompletion(true) ? 0 : 1;
            }
        
        public static void main(String[] args) throws Exception {
        
                //...
                //managing input arguments
                //...
        
                CommandLineParser parser = new BasicParser();
                HelpFormatter formatter = new HelpFormatter();
                try {
                    cmd = parser.parse(options, args);
                } catch (ParseException e) {
                    formatter.printHelp("usage:", options);
                    System.exit(1);
                }
                Integer k = Integer.parseInt(cmd.getOptionValue("k"));
                Integer m = Integer.parseInt(cmd.getOptionValue("m"));
                String file_in_string = cmd.getOptionValue("file_in");
                String file_out_string = cmd.getOptionValue("file_out");
                Configuration conf = new Configuration();
                conf.set("file_in", file_in_string);
                conf.set("file_out", file_out_string);
                util = new Utility(k, m);
                int res = ToolRunner.run(conf, new HadoopRun(), args);
        
                System.exit(res);
            }
        
        }
        

1 个答案:

答案 0 :(得分:0)

如果你想要一个方法来计算长度x

的任何子串的发生
    /**
     * 
     * @param s string to get substrings from 
     * @param x length of substring you want
     * @return hashmap with each substring being the keys and amount of occurrences being the values 
     */
    public static HashMap countSub(String s,int x){
        HashMap<String,Integer> hm = new HashMap();
        int to = s.length()-x+1;
        hm.put(s.substring(0,x),1);


        for(int i=1;i<to;i++){
            x++;
            String next = s.substring(i,x);
            boolean b = false;

            for (String key : hm.keySet()) {
                b = key.equals(next);
                //if key already exists increment value
                if(b) {
                    hm.put(key,hm.get(key)+1);
                    break;
                }
            }
            //else make new key
            if(!b) hm.put(next,1);
        }
        return hm;
    }

该方法返回一个哈希映射,它看起来有点像你问题中的格式,其中每个键都是一个子字符串,每个值都是该子字符串的出现次数。