比较RDD对象 - Apache Spark

时间:2017-02-24 16:30:43

标签: java performance apache-spark rdd bigdata

我是apache spark技术的新手,在尝试分析我从文件中提取的数据时遇到了一些问题。

我有大量的基因信息,我将这些信息提取到RDD,到目前为止一直很好。

JavaRDD<Gene> inputfile = sc.textFile(logFile).map(
        new Function<String, Gene>() {
            @Override
            public Gene call(String line) throws Exception {
                String[] values = line.split("\t");
                Gene gen = null;

                //We are only interested in genes;
                if( values.length > 2 && values[2].equalsIgnoreCase("gene") && !line.contains("#")){
                    String[] infoGene = values[8].split(";");

                    String geneId = StringUtils.substringBetween(infoGene[0], "\"");
                    String geneType = StringUtils.substringBetween(infoGene[2], "\"");
                    String geneName = StringUtils.substringBetween(infoGene[4], "\"");
                    gen = new Gene(geneName,values[3],values[4]);

                    return gen;
                }
                return gen;
            }
        }
    ).filter(new Function<Gene, Boolean>() {
        @Override
        public Boolean call(Gene gene) throws Exception {
            if(gene == null)
                return false;
            else
                return true;
        }
    });

基因类:

public class Gene implements Serializable{
 String firstBp;
 String lastBp;
 String name;

 public Gene(String name, String firstBp, String lastBp) {
    this.name = name;
    this.firstBp = firstBp;
    this.lastBp = lastBp;
 }

 public String getFirstBp() {
    return firstBp;
 }

 public String getLastBp() {
    return lastBp;
 }

 public String getName() {
    return name;
 }

 public String toString(){
    return name + " " + firstBp + " " + lastBp;
 }}

问题从这里开始,我需要分析2 Genes是否覆盖,为此我已经制作了这个简单的实用功能:

 public static Boolean isOverlay(Gene gene1, Gene gene2){
    int gene1First = Integer.parseInt(gene1.getFirstBp());
    int gene1Last = Integer.parseInt(gene1.getLastBp());
    int gene2First = Integer.parseInt(gene2.getFirstBp());
    int gene2Last = Integer.parseInt(gene2.getLastBp());

    if(gene2First >= gene1First && gene2First <= gene1Last) // FirstBp - Gene2 inside
        return true;
    else if (gene2Last >= gene1First && gene2Last <= gene1Last) // LastBP - Gene2 inside
        return true;
    else if (gene1First >= gene2First && gene1First <= gene2Last) // FirstBp - Gene1 inside
        return true;
    else if (gene1Last >= gene2First && gene1Last <= gene2Last) // LastBP - Gene1 inside
        return true;
    else
        return false;
}

现在我正在做什么,我认为错误的是通过执行以下操作将RDD对象转换为列表:

 List<Gene> genesList = inputfile.collect();

迭代该列表以检查是否存在叠加并将结果保存到文件中因为我没有使用spark。

 List<OverlayPair> overlayPairList= new ArrayList<OverlayPair>();
 List<String> visitedGenes = new ArrayList<String>();

 for (Gene gene1 : genesList){

        for (Gene gene2 : genesList) {
            if (gene1.getName().equalsIgnoreCase(gene2.getName()) || visitedGenes.contains(gene2.getName())) {
                continue;
            }

            if (isOverlay(gene1, gene2))
                overlayPairList.add(new OverlayPair(gene1.getName(), gene2.getName()));

        }
        visitedGenes.add(gene1.getName());
    }

    JavaRDD<OverlayPair> overlayFile = sc.parallelize(overlayPairList);

    //Export the results to the file
    String outputDirectory = "/Users/joaoalmeida/Desktop/Dissertacao/sol/data/mitocondrias/feup-pp/project/data/output/overlays";
    overlayFile.coalesce(1).saveAsTextFile(outputDirectory);

Overlay对基本上是一个具有2个基因名称的对象。

在利用火花时,有没有做到第二部分?因为那些2的时间复杂度对于我目前拥有的数据量来说是很大的。

1 个答案:

答案 0 :(得分:1)

是的,有,你必须使用if cellState.isSelected { var parentMinDimension = min(view.frame.width, view.frame.height) parentMinDimension = round(parentMinDimension - 0.5) myCustomCell.widthConstraint.constant = parentMinDimension myCustomCell.heightConstraint.constant = parentMinDimension myCustomCell.selectedView.layer.cornerRadius = parentMinDimension / 2 myCustomCell.selectedView.isHidden = false } else { myCustomCell.selectedView.isHidden = true } 函数来获取所有对,然后你基本上可以应用你写的函数。