Spark工作似乎不能很好地并行化

时间:2014-11-19 21:44:49

标签: hadoop bigdata apache-spark google-hadoop

使用Spark 1.1

我的工作如下:

  1. 读取给定根目录下的文件夹列表,并行化列表
  2. 对于每个文件夹,请阅读其下的文件 - 这些是gzip压缩文件
  3. 对于每个文件,提取内容 - 这些是行,每行代表一个事件,字段用制表符分隔(TSV)
  4. 创建所有行的单个RDD。
  5. 将TSV转换为json。
  6. (现在这些行代表某种事件类型。有4种类型:会话,请求,推荐,用户事件)

    1. 仅过滤掉会话事件。根据某些用户ID字段,仅对其中的1:100进行采样。将它们转换为一对,使用表示某种输出结构的键(如:事件类型/日期/事件),然后将其写入FS。
    2. 对请求和用户事件执行相同的操作
    3. (对于建议,不能根据用户ID进行抽样(因为那里不存在),但我们知道基于相互请求id字段的请求和推荐之间存在1:1的关系。所以:)

      1. 创建不同请求ID的列表。使用基于请求ID作为键的推荐列表加入此列表,从而实现我们想要的过滤。然后将缩小列表输出到FS。
      2. 现在,这是我的问题。我用来做这些事情的代码是小规模的。但是当我运行相对较大的输入时,我使用80个机器的集群,每个机器有8个核心和50GB内存,我可以看到许多机器没有被利用,这意味着只占用了一个核心(也只有~20%),并且内存仅为配置为该作业的40GB中的16GB。

        我认为某些地方我的转换不能很好地并行化,但我不知道在哪里以及为什么。这是我的大部分代码(我省略了一些我认为与问题无关的辅助功能)

         public static void main(String[] args) {
        
            BasicConfigurator.configure();
        
            conf[0] = new Conf("local[4]");
            conf[1] = new Conf("spark://hadoop-m:7077");
            Conf configuration = conf[1];
        
            if (args.length != 4) {
                log.error("Error in parameters. Syntax: <input path> <output_path> <filter_factor> <locality>\nfilter_factor is what fraction of sessions to process. For example, to process 1/100 of sessions, use 100\nlocality should be set to \"local\" in case running on local environment, and to \"remote\" otherwise.");
                System.exit(-1);
            }
        
            final String inputPath = args[0];
            final String outputPath = args[1];
            final Integer filterFactor;
        
            if (args[3].equals("local")) {
                configuration = conf[0];
            }
        
            log.setLevel(Level.DEBUG);
            Logger.getRootLogger().removeAppender("console");
            final SparkConf conf = new SparkConf().setAppName("phase0").setMaster(configuration.getMaster());
            conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
            conf.set("spark.kryo.registrator", "com.doit.customer.dataconverter.MyRegistrator");
            final JavaSparkContext sc = new JavaSparkContext(conf);
            if (configuration.getMaster().contains("spark:")) {
                sc.addJar("/home/hadoop/hadoop-install/phase0-1.0-SNAPSHOT-jar-with-dependencies.jar");
            }
            try {
                filterFactor = Integer.parseInt(args[2]);
                // read all folders from root
                Path inputPathObj = new Path(inputPath);
                FileSystem fs = FileSystem.get(inputPathObj.toUri(), new Configuration(true));
                FileStatus[] statusArr = fs.globStatus(inputPathObj);
                List<FileStatus> statusList = Arrays.asList(statusArr);
        
                List<String> pathsStr = convertFileStatusToPath(statusList);
        
                JavaRDD<String> paths = sc.parallelize(pathsStr);
        
                // read all files from each folder
                JavaRDD<String> filePaths = paths.mapPartitions(new FlatMapFunction<Iterator<String>, String>() {
                    @Override
                    public Iterable<String> call(Iterator<String> pathsIterator) throws Exception {
                        List<String> filesPath = new ArrayList<String>();
                        if (pathsIterator != null) {
                            while (pathsIterator.hasNext()) {
                                String currFolder = pathsIterator.next();
                                Path currPath = new Path(currFolder);
                                FileSystem fs = FileSystem.get(currPath.toUri(), new Configuration(true));
                                FileStatus[] files = fs.listStatus(currPath);
                                List<FileStatus> filesList = Arrays.asList(files);
                                List<String> filesPathsStr = convertFileStatusToPath(filesList);
                                filesPath.addAll(filesPathsStr);
                            }
                        }
                        return filesPath;
                    }
                });
        
        
                // Transform list of files to list of all files' content in lines
                JavaRDD<String> typedData = filePaths.map(new Function<String, List<String>>() {
                    @Override
                    public List<String> call(String filePath) throws Exception {
                        Tuple2<String, List<String>> tuple = null;
                        try {
                            String fileType = null;
                            List<String> linesList = new ArrayList<String>();
                            Configuration conf = new Configuration();
                            CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
                            Path path = new Path(filePath);
                            fileType = getType(path.getName());
        
                            // filter non-trc files
                            if (!path.getName().startsWith("1")) {
                                return linesList;
                            }
        
                            CompressionCodec codec = compressionCodecs.getCodec(path);
                            FileSystem fs = path.getFileSystem(conf);
                            InputStream in = fs.open(path);
                            if (codec != null) {
                                in = codec.createInputStream(in);
                            } else {
                                throw new IOException();
                            }
        
                            BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF-8"), BUFFER_SIZE);
        
                            // This line will not be added to the list ,
                            // which is what we want - filter the header row
                            String line = r.readLine();
        
                            // Read all lines
                            while ((line = r.readLine()) != null) {
                                try {
                                    String sliceKey = getSliceKey(line, fileType);
                                    // Adding event type and output slice key as additional fields
                                    linesList.add(fileType + "\t" + sliceKey + "\t" + line);
                                } catch(ParseException e) {
                                }
                            }
        
                            return linesList;
                        } catch (Exception e) { // Filtering of files whose reading went wrong
                            log.error("Reading of the file " + filePath + " went wrong: " + e.getMessage());
                            return new ArrayList();
                        }
                    }
                    // flatten to one big list with all the lines
                }).flatMap(new FlatMapFunction<List<String>, String>() {
                    @Override
                    public Iterable<String> call(List<String> strings) throws Exception {
                        return strings;
                    }
                });
        
                // convert tsv to json
        
                JavaRDD<ObjectNode> jsons = typedData.mapPartitions(new FlatMapFunction<Iterator<String>, ObjectNode>() {
                    @Override
                    public Iterable<ObjectNode> call(Iterator<String> stringIterator) throws Exception {
                        List<ObjectNode> res = new ArrayList<>();
                        while(stringIterator.hasNext()) {
                            String currLine = stringIterator.next();
                            Iterator<String> i = Splitter.on("\t").split(currLine).iterator();
                            if (i.hasNext()) {
                                String type = i.next();
                                ObjectNode json = convert(currLine, type, filterFactor);
                                if(json != null) {
                                    res.add(json);
                                }
                            }
                        }
                        return res;
                    }
                }).cache();
        
        
                createOutputType(jsons, "Session", outputPath, null);
                createOutputType(jsons, "UserEvent", outputPath, null);
                JavaRDD<ObjectNode> requests = createOutputType(jsons, "Request", outputPath, null);
        
        
                // Now leave only the set of request ids - to inner join with the recommendations
                JavaPairRDD<String,String> requestsIds = requests.mapToPair(new PairFunction<ObjectNode, String, String>() {
                    @Override
                    public Tuple2<String, String> call(ObjectNode jsonNodes) throws Exception {
                        String id = jsonNodes.get("id").asText();
                        return new Tuple2<String, String>(id,id);
                    }
                }).distinct();
        
                createOutputType(jsons,"RecommendationList", outputPath, requestsIds);
        
            } catch (IOException e) {
                log.error(e);
                System.exit(1);
            } catch (NumberFormatException e) {
                log.error("filter factor is not a valid number!!");
                System.exit(-1);
            }
        
            sc.stop();
        
        }
        
        private static JavaRDD<ObjectNode> createOutputType(JavaRDD jsonsList, final String type, String outputPath,JavaPairRDD<String,String> joinKeys) {
        
            outputPath = outputPath + "/" + type;
        
            JavaRDD events = jsonsList.filter(new Function<ObjectNode, Boolean>() {
                @Override
                public Boolean call(ObjectNode jsonNodes) throws Exception {
                    return jsonNodes.get("type").asText().equals(type);
                }
            });
        
        
            // This is in case we need to narrow the list to match some other list of ids... Recommendation List, for example... :)
            if(joinKeys != null) {
                JavaPairRDD<String,ObjectNode> keyedEvents = events.mapToPair(new PairFunction<ObjectNode, String, ObjectNode>() {
                    @Override
                    public Tuple2<String, ObjectNode> call(ObjectNode jsonNodes) throws Exception {
                        return new Tuple2<String, ObjectNode>(jsonNodes.get("requestId").asText(),jsonNodes);
                    }
                });
        
                JavaRDD<ObjectNode> joinedEvents = joinKeys.join(keyedEvents).values().map(new Function<Tuple2<String, ObjectNode>, ObjectNode>() {
                   @Override
                   public ObjectNode call(Tuple2<String, ObjectNode> stringObjectNodeTuple2) throws Exception {
                       return stringObjectNodeTuple2._2;
                   }
                });
                events = joinedEvents;
            }
        
        
            JavaPairRDD<String,Iterable<ObjectNode>> groupedEvents = events.mapToPair(new PairFunction<ObjectNode, String, ObjectNode>() {
                @Override
                public Tuple2<String, ObjectNode> call(ObjectNode jsonNodes) throws Exception {
                    return new Tuple2<String, ObjectNode>(jsonNodes.get("sliceKey").asText(),jsonNodes);
                }
            }).groupByKey();
            // Add convert jsons to strings and add "\n" at the end of each
        
            JavaPairRDD<String, String> groupedStrings = groupedEvents.mapToPair(new PairFunction<Tuple2<String, Iterable<ObjectNode>>, String, String>() {
                @Override
                public Tuple2<String, String> call(Tuple2<String, Iterable<ObjectNode>> content) throws Exception {
                    String string = jsonsToString(content._2);
                    log.error(string);
                    return new Tuple2<>(content._1, string);
                }
            });
            groupedStrings.saveAsHadoopFile(outputPath, String.class, String.class, KeyBasedMultipleTextOutputFormat.class);
            return events;
        }
        
        // Notice the special case of if(joinKeys != null) in which I join the recommendations with request ids.
        

        最后,我用来启动Spark作业的命令是:

        spark-submit --class com.doit.customer.dataconverter.Phase0 --driver-cores 8 --total-executor-cores 632 --driver-memory 40g --executor-memory 40G --deploy-mode cluster /home/hadoop/hadoop-install/phase0-1.0-SNAPSHOT-jar-with-dependencies.jar gs://input/2014_07_31* gs://output/2014_07_31 100 remote
        

1 个答案:

答案 0 :(得分:2)

您的初始分区基于根目录中的文件夹集(sc.parallelize(pathsStr))。您的流程中有两个步骤可能会严重影响您的分区:1)读取每个文件夹中的文件列表,如果某些文件夹的文件比其他文件夹多得多; 2)从每个文件读取TSV行,如果某些文件的行数比其他文件多。

如果您的文件大小大致相同,但某些文件夹中的文件大于其他文件夹,则可以在收集文件名后重新平衡分区。设置filePaths的初始值后,请尝试添加以下行:

filePaths = filePaths.repartition(sc.defaultParallelism());

这会将收集的文件名改组为平衡分区。

如果由于某些文件比其他文件大得多而导致不平衡,您可以尝试通过类似地调用其上的重新分区来重新平衡typedData RDD,尽管这会更加昂贵,因为它会将所有TSV数据混洗。

或者,如果重新平衡filePaths并且仍然存在一些分区不平衡,这是由于在一些分区中有一些稍大的文件导致的,那么通过在重新分区中使用更大的数字,可能会获得更好的性能。参数,例如乘以4,使得分区的数量是核心的四倍。这会稍微增加通信成本,但如果它能更好地平衡typedData中产生的分区大小,那么可能会获胜。