如何从Java String数组创建Spark广播变量?

时间:2015-09-12 10:02:16

标签: apache-spark

您好我有Java String数组,其中包含45个字符串,基本上是列名

String[] fieldNames = {"colname1","colname2",...}; 

目前,我将上面的String数组存储在静态字段中的Spark驱动程序中。我的工作运行缓慢所以试图重构代码。我在创建DataFrame时使用上面的String数组

DataFrame dfWithColNames = sourceFrame.toDF(fieldNames); 

我想使用广播变量进行上述操作,因为它不向每个执行者发送大字符串数组我相信我们可以做以下的事情来创建广播

String[] brArray = sc.broadcast(fieldNames,String[].class);//gives compilation error 

DataFrame df = sourceFrame.toDF(???);//how do I use above broadcast can I use it as is by passing brArray 

请指导我是Spark的新手。非常感谢。

3 个答案:

答案 0 :(得分:8)

sc.broadcast的返回变量属于Broadcast<String[]>,而非String[]。如果要访问该值,只需在变量上调用value()即可。从你的例子来看,它将是:

Broadcast<String[]> broadcastedFieldNames = sc.broadcast(fieldNames)
DataFrame df = sourceFrame.toDF(broadcastedFieldNames.value())

请注意,如果您使用Java编写此函数,则可能需要将SparkContext包装在JavaSparkContext中。它使一切变得更容易,然后您可以避免将ClassTag传递给广播函数。

您可以在http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

上阅读有关广播变量的更多信息

答案 1 :(得分:7)

这是一个有点老问题,但是,我希望我的解决方案可以帮助某人。

为了使用Spark 2+广播任何对象(可以是单个POJO或集合),您首先需要使用以下方法为您创建classTag:

private static <T> ClassTag<T> classTag(Class<T> clazz) {
   return scala.reflect.ClassManifestFactory.fromClass(clazz);
}

接下来,您使用SparkSession中的JavaSparkContext来广播您的对象:

   sparkSession.sparkContext().broadcast(
            yourObject,
            classTag(YourObject.class)
    )

如果是集合,比如java.util.List,则使用以下命令:

    sparkSession.sparkContext().broadcast(
            yourObject,
            classTag(List.class)
    )

答案 2 :(得分:0)

    ArrayList<String> dataToBroadcast = new ArrayList();
    dataToBroadcast .add("string1");
    ...
    dataToBroadcast .add("stringn");
                        
  //Creating the broadcast variable
  //No need to write classTag code by hand use akka.japi.Util which is available
                    
  Broadcast<ArrayList<String>> strngBrdCast = spark.sparkContext().broadcast(
                                      dataToBroadcast,
                                      akka.japi.Util.classTag(ArrayList.class));
                        
    //Here is the catch.When you are iterating over a Dataset, 
    //Spark will actally run it in distributed mode. So if you try to accees
    //Your object directly (e.g. dataToBroadcast) it would be null . 
    //Cause you didn't ask spark to explicitly send tha outside variable to each
    //machine where you are running this for each parallelly.
    //So you need to use Broadcast variable.(Most common use of Broadcast)  
    
    someSparkDataSetWhere.foreach((row) -> {
     ArrayList<String> stringlist = strngBrdCast.value();
     ...
     ...
    })