如何访问Java Spark Broadcast变量?

时间:2018-08-28 10:36:28

标签: java apache-spark

我正试图广播Dataset以便从map函数中访问它。第一个打印语句按预期返回广播数据集的第一行。不幸的是,第二个打印语句没有返回结果。此时执行只是挂起。 知道我在做什么错吗?

    Broadcast<JavaRDD<Row>> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData.toJavaRDD());

    System.out.println("Data:" + broadcastedTrainingData.value().first());
    JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> {
        System.out.println("Data (map):" + broadcastedTrainingData.value().first());
        return RowFactory.create(row);
    });

以下伪代码重点介绍了我要实现的目标。我的主要目标是广播训练数据集,因此我可以在地图功能中使用它。

    public Dataset<Row> getWSSE(Dataset<Row> trainingData, int clusterRange) {
        StructType structType = new StructType();
        structType = structType.add("ClusterAm", DataTypes.IntegerType, false);
        structType = structType.add("Cost", DataTypes.DoubleType, false);

        List<Integer> stringAsList = new ArrayList<>();
        for (int clusterAm = 2; clusterAm < clusterRange + 2; clusterAm++) {
            stringAsList.add(clusterAm);
        }

        Broadcast<Dataset> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData);

        System.out.println("Data:" + broadcastedTrainingData.value().first());
        JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> RowFactory.create(row));

        StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("ClusterAm", DataTypes.IntegerType, false)});

        Dataset wsse = sqlContext.createDataFrame(rowRDD, schema).toDF();
        wsse.show();

        ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);

        Dataset result = wsse.map(
                (MapFunction<Row, Row>) row -> RowFactory.create(row.getAs("ClusterAm"), new KMeans().setK(row.getAs("ClusterAm")).setSeed(1L).fit(broadcastedTrainingData.value()).computeCost(broadcastedTrainingData.value())),
                encoder);

        result.show();
        broadcastedTrainingData.destroy();
        return wsse;
    }

1 个答案:

答案 0 :(得分:0)

        DataSet<Row> trainingData = ...<Your dataset>;
                            
       //Creating the broadcast variable. No need to write classTag code by hand 
       // use akka.japi.Util which is available
                        
        Broadcast<Dataset<Row>> broadcastedTrainingData = spark.sparkContext()
              .broadcast(trainingData, akka.japi.Util.classTag(DataSet.class));
                            
        //Here is the catch.When you are iterating over a Dataset, 
        //Spark will actally run it in distributed mode. So if you try to accees
        //Your object directly (e.g. trainingData) it would be null . 
        //Cause you didn't ask spark to explicitly send tha outside variable to
        //each machine where you are running this for each parallelly.
        //So you need to use Broadcast variable.(Most common use of Broadcast)  
        
        someSparkDataSet.foreach((row) -> {
         DataSet<Row>  recieveBrdcast = broadcastedTrainingData.value();
         ...
         ...
        })