Question

这是关于此问题的后续问题：Spark FlatMap function for huge lists

总结：我想在Java8中编写一个Spark FlatMap函数，它生成与一组dna序列匹配的所有可能的正则表达式。对于巨大的字符串，这是有问题的，因为正则表达式集合不适合内存（一个映射器容易生成数十亿字节的数据）。我知道我必须采用像懒惰序列这样的东西，我假设我必须使用Stream<String>。 我现在的问题是如何构建此流。我在这里看看：Java Streams - Stream.Builder。

如果我的算法开始生成模式，可以使用accept(String)方法将它们“推送”到Stream中，但是当我尝试使用一些日志语句链接中的代码（用字符串生成器函数替换它）时之间我注意到随机字符串生成函数在调用build()之前执行。我不明白如果它们不能适合内存，所有随机字符串将如何存储。

我是否必须以不同的方式构建流？基本上我想要在我的context.write(substring)函数中拥有我MapReduce Mapper.map的等价物。

UPDATE1：无法使用范围功能，实际上我使用的是结构   它遍历后缀树。

UPDATE2：根据要求提供更完整的实施，我没有更换   与实际实现的接口因为   实现是非常大的，并不是掌握这个想法的必要条件。

更完整的问题草图

我的算法试图发现DNA序列的模式。该算法采用对应于相同基因的不同生物的序列。假设我在mays中有一个基因A，并且在水稻和其他一些物种中具有相同的基因A然后我比较它们上游序列。我正在寻找的模式类似于正则表达式，例如TGA..GA..GA。探索所有可能的模式我从序列中构建了一个通用的后缀树。该树提供有关不同序列的信息出现一个模式。为了将树与搜索算法分离，我实现了某种迭代器结构：TreeNavigator。它有以下界面：

interface TreeNavigator {
        public void jumpTo(char c); //go from pattern p to p+c (c can be a dot from a regex or [AC] for example)
        public void backtrack(); //pop the last character
        public List<Position> getMatches();
        public Pattern trail(); //current pattern p
    }

interface SearchSpace {
        //degrees of freedom in regex, min and maxlength,...
    public boolean inSearchSpace(Pattern p); 
    public Alphabet getPatternAlphabet();
}

interface ScoreCalculator {
    //calculate a score, approximately equal to the number of occurrences of the pattern
    public Score calcConservationScore(TreeNavigator t);
}

//Motif algorithm code which is run in the MapReduce Mapper function:
public class DiscoveryAlgorithm {
    private Context context; //MapReduce context object to write to disk
    private Score minScore;

    public void runDiscovery(){
    //depth first traveral of pattern space A, AA, AAA,... AAC, ACA, ACC and so fort
        exploreSubTree(new TreeNavigator());
    }

    //branch and bound for pattern space, if pattern occurs too little, stop searching
    public boolean survivesBnB(Score s){
        return s.compareTo(minScore)>=0;
    }

    public void exploreSubTree(Navigator nav){
        Pattern current = nav.trail();
        Score currentScore = ScoreCalculator.calc(nav);

        if (!survivesBnB(currentScore)}{
           return;
        }


        if (motif in searchspace)
            context.write(pattern);

        //iterate over all possible extensions: A,C,G,T, [AC], [AG],... [ACGT]
        for (Character c in SearchSpace.getPatternAlphabet()){
             nav.jumpTo(c);
             exploreSubTree(nav);
             nav.backtrack();
        }
    }
}

FULL MapReduce SOURCE @ https://github.com/drdwitte/CloudSpeller/ 相关研究论文：http://www.ncbi.nlm.nih.gov/pubmed/26254488

更新3：我继续阅读有关创建Stream的方法。从到目前为止我读到的内容我认为我必须重写我的runDiscovery（）进入供应商。然后可以将此供应商转换为通过StreamSupport类的流。

Answer 1

@LukasEder解决方案的替代方案，我认为更有效：

IntStream.range(0, string.length())
    .mapToObj(start -> IntStream.rangeClosed(start+1, string.length())
            .mapToObj(end -> string.substring(start, end)))
    .flatMap(Function.identity())
    .forEach(System.out::println);

请求基准

更新，here it is（Java 8u45，x64，字符串长度为10,100,1000）：

Benchmark                  (len)  Mode  Cnt      Score     Error  Units
SubstringTest.LukasEder       10  avgt   30      1.947 ±   0.012  us/op
SubstringTest.LukasEder      100  avgt   30    151.660 ±   0.524  us/op
SubstringTest.LukasEder     1000  avgt   30  52405.761 ± 183.921  us/op
SubstringTest.TagirValeev     10  avgt   30      1.712 ±   0.018  us/op
SubstringTest.TagirValeev    100  avgt   30    138.179 ±   5.063  us/op
SubstringTest.TagirValeev   1000  avgt   30  48188.499 ± 107.321  us/op

嗯，@ LukasEder解决方案只慢了8-13％，可能没那么多。

构建不适合内存的Stream

1 个答案: