当多个线程共享相同的spark上下文时,spark应用程序不会停止

时间:2017-01-06 08:02:35

标签: java apache-spark

我试图重现我面临的问题。我的问题陈述 - 在一个文件夹中存在多个文件。我需要为每个文件进行字数统计并打印结果。每个文件都应该并行处理!当然,并行性是有限的。我已经编写了以下代码来完成它。它运行正常。集群正在安装mapR的spark。集群有spark.scheduler.mode = FIFO。

  

Q1-是否有更好的方法来完成上述任务?

     

Q2-我观察到应用程序即使在它时也不会停止   已经完成了avaialble文件的计数。我无法   弄清楚如何处理它?<​​/ p>

package groupId.artifactId;

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

public class Executor {

    /**
     * @param args
     */
    public static void main(String[] args) {    
        final int threadPoolSize = 5;       
        SparkConf sparkConf = new SparkConf().setMaster("yarn-client").setAppName("Tracker").set("spark.ui.port","0");
        JavaSparkContext jsc = new JavaSparkContext(sparkConf); 
        ExecutorService executor = Executors.newFixedThreadPool(threadPoolSize);
        List<Future> listOfFuture = new ArrayList<Future>();
        for (int i = 0; i < 20; i++) {
            if (listOfFuture.size() < threadPoolSize) {
                FlexiWordCount flexiWordCount = new FlexiWordCount(jsc, i);
                Future future = executor.submit(flexiWordCount);
                listOfFuture.add(future);               
            } else {
                boolean allFutureDone = false;
                while (!allFutureDone) {
                    allFutureDone = checkForAllFuture(listOfFuture);
                    System.out.println("Threads not completed yet!");
                    try {
                        Thread.sleep(2000);//waiting for 2 sec, before next check
                    } catch (InterruptedException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    }
                }
                printFutureResult(listOfFuture);
                System.out.println("printing of future done");
                listOfFuture.clear();
                System.out.println("future list got cleared");
            }

        }
        try {
            executor.awaitTermination(5, TimeUnit.MINUTES);
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        }



    private static void printFutureResult(List<Future> listOfFuture) {
        Iterator<Future> iterateFuture = listOfFuture.iterator();
        while (iterateFuture.hasNext()) {
            Future tempFuture = iterateFuture.next();
            try {
                System.out.println("Future result " + tempFuture.get());
            } catch (InterruptedException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } catch (ExecutionException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
    }
    private static boolean checkForAllFuture(List<Future> listOfFuture) {
        boolean status = true;
        Iterator<Future> iterateFuture = listOfFuture.iterator();
        while (iterateFuture.hasNext()) {
            Future tempFuture = iterateFuture.next();
            if (!tempFuture.isDone()) {
                status = false;
                break;
            }
        }
        return status;

}

package groupId.artifactId;

import java.io.Serializable;
import java.util.Arrays;
import java.util.concurrent.Callable;

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;

import scala.Tuple2;

public class FlexiWordCount implements Callable<Object>,Serializable {


    private static final long serialVersionUID = 1L;
    private JavaSparkContext jsc;
    private int fileId;

    public FlexiWordCount(JavaSparkContext jsc, int fileId) {
        super();
        this.jsc = jsc;
        this.fileId = fileId;
    }
    private static class Reduction implements Function2<Integer, Integer, Integer>{
        @Override
        public Integer call(Integer i1, Integer i2) {
            return i1 + i2;
        }
    }

    private static class KVPair implements PairFunction<String, String, Integer>{
        @Override
        public Tuple2<String, Integer> call(String paramT)
                throws Exception {
            return new Tuple2<String, Integer>(paramT, 1);
        }
    }
    private static class Flatter implements FlatMapFunction<String, String>{

        @Override
        public Iterable<String> call(String s) {
            return Arrays.asList(s.split(" "));
        }
    }
    @Override
    public Object call() throws Exception { 
        JavaRDD<String> jrd = jsc.textFile("/root/folder/experiment979/" + fileId +".txt");
        System.out.println("inside call() for fileId = " + fileId);
        JavaRDD<String> words = jrd.flatMap(new Flatter());
        JavaPairRDD<String, Integer> ones = words.mapToPair(new KVPair());      
        JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Reduction());
        return counts.collect();
    }
}
}

1 个答案:

答案 0 :(得分:0)

为什么程序不会自动关闭?

Ans:您尚未关闭Sparkcontex,请尝试将main方法更改为:

public static void main(String[] args) {    
    final int threadPoolSize = 5;       
    SparkConf sparkConf = new SparkConf().setMaster("yarn-client").setAppName("Tracker").set("spark.ui.port","0");
    JavaSparkContext jsc = new JavaSparkContext(sparkConf); 
    ExecutorService executor = Executors.newFixedThreadPool(threadPoolSize);
    List<Future> listOfFuture = new ArrayList<Future>();
    for (int i = 0; i < 20; i++) {
        if (listOfFuture.size() < threadPoolSize) {
            FlexiWordCount flexiWordCount = new FlexiWordCount(jsc, i);
            Future future = executor.submit(flexiWordCount);
            listOfFuture.add(future);               
        } else {
            boolean allFutureDone = false;
            while (!allFutureDone) {
                allFutureDone = checkForAllFuture(listOfFuture);
                System.out.println("Threads not completed yet!");
                try {
                    Thread.sleep(2000);//waiting for 2 sec, before next check
                } catch (InterruptedException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
            }
            printFutureResult(listOfFuture);
            System.out.println("printing of future done");
            listOfFuture.clear();
            System.out.println("future list got cleared");
        }

    }
    try {
        executor.awaitTermination(5, TimeUnit.MINUTES);
    } catch (InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
   jsc.stop()
    }

有更好的方法吗?

答案:是的,您应该将文件目录传递给sparkcontext并在目录上使用.textFile,在这种情况下,spark会对执行程序上的目录中的读取进行并行化。如果您尝试自己创建线程,然后使用相同的spark上下文为每个文件重新提交作业,则会增加将应用程序提交到yarn队列的额外开销。

我认为最快的方法是直接传递整个目录并从中创建RDD,然后让spark启动并行任务来处理不同执行程序中的所有文件。您可以尝试使用.repartition()方法RDD,因为它会启动许多任务并行运行。