使用Scala中的集合在Spark中创建数据框

时间:2018-02-21 17:02:24

标签: scala apache-spark

我想使用Scala在Spark中创建一个虚拟数据框,如下所示 -

+-----------+----+
|channel_set|rate|
+-----------+----+
|     [A, D]| 0.0|
|        [C]| 0.0|
|        [D]| 1.0|
|     [B, A]| 0.5|
+-----------+----+

我尝试过以下代码 -

val b = Array((Set("A","D"),0.0) , (Set("C"),0.0), (Set("D"),1.0), (Set("B","A"),0.5) )
val dummy_data = sc.parallelize(b).toDF("channel_set", "rate")

但面临错误 -

scala> val dummy_data = sc.parallelize(b).toDF("channel_set", "rate")
java.lang.UnsupportedOperationException: No Encoder found for scala.collection.immutable.Set[java.lang.String]
- field (class: "scala.collection.immutable.Set", name: "_1")
- root class: "scala.Tuple2"

请帮助。

2 个答案:

答案 0 :(得分:2)

Array可变对象数据帧/数据集应该具有静态架构,即固定数据类型。因此,使用SeqList应该适合您,因为它们是不可变的。

val df = Seq(
  (Array("A","D"),0.0),
  (Array("C"),0.0),
  (Array("D"),1.0),
  (Array("B","A"),0.5)
).toDF("channel_set", "rate")

df.show(false)

您应该将数据框设为

+-----------+----+
|channel_set|rate|
+-----------+----+
|[A, D]     |0.0 |
|[C]        |0.0 |
|[D]        |1.0 |
|[B, A]     |0.5 |
+-----------+----+

答案 1 :(得分:2)

如果您查看错误消息,那就是Spark的SQL / DataFrame API不支持的Set类型:

java.lang.UnsupportedOperationException:
  No Encoder found for scala.collection.immutable.Set[java.lang.String]

这是data types supported by Spark SQL/DataFrame。也就是说,如果需要,您可以在Set内使用UDF

在创建DataFrame时,Spark以类似的方式处理Seq,List,Array。如果您在以下3个数据框架上执行printSchemashow,您会发现它们完全相同。

sc.parallelize(Array(
    (Array("A","D"),0.0) , (Array("C"),0.0), (Array("D"),1.0), (Array("B","A"),0.5)
  )).toDF("channel_set", "rate")

sc.parallelize(List(
    (List("A","D"),0.0) , (List("C"),0.0), (List("D"),1.0), (List("B","A"),0.5)
  )).toDF("channel_set", "rate")

sc.parallelize(Seq(
    (Seq("A","D"),0.0) , (Seq("C"),0.0), (Seq("D"),1.0), (Seq("B","A"),0.5)
  )).toDF("channel_set", "rate")

// res.printSchema
// root
//  |-- channel_set: array (nullable = true)
//  |    |-- element: string (containsNull = true)
//  |-- rate: double (nullable = false)

// res.show
// +-----------+----+
// |channel_set|rate|
// +-----------+----+
// |     [A, D]| 0.0|
// |        [C]| 0.0|
// |        [D]| 1.0|
// |     [B, A]| 0.5|
// +-----------+----+