我想使用Scala在Spark中创建一个虚拟数据框,如下所示 -
+-----------+----+
|channel_set|rate|
+-----------+----+
| [A, D]| 0.0|
| [C]| 0.0|
| [D]| 1.0|
| [B, A]| 0.5|
+-----------+----+
我尝试过以下代码 -
val b = Array((Set("A","D"),0.0) , (Set("C"),0.0), (Set("D"),1.0), (Set("B","A"),0.5) )
val dummy_data = sc.parallelize(b).toDF("channel_set", "rate")
但面临错误 -
scala> val dummy_data = sc.parallelize(b).toDF("channel_set", "rate")
java.lang.UnsupportedOperationException: No Encoder found for scala.collection.immutable.Set[java.lang.String]
- field (class: "scala.collection.immutable.Set", name: "_1")
- root class: "scala.Tuple2"
请帮助。
答案 0 :(得分:2)
Array
是可变对象,数据帧/数据集应该具有静态架构,即固定数据类型。因此,使用Seq
或List
应该适合您,因为它们是不可变的。
val df = Seq(
(Array("A","D"),0.0),
(Array("C"),0.0),
(Array("D"),1.0),
(Array("B","A"),0.5)
).toDF("channel_set", "rate")
df.show(false)
您应该将数据框设为
+-----------+----+
|channel_set|rate|
+-----------+----+
|[A, D] |0.0 |
|[C] |0.0 |
|[D] |1.0 |
|[B, A] |0.5 |
+-----------+----+
答案 1 :(得分:2)
如果您查看错误消息,那就是Spark的SQL / DataFrame API不支持的Set
类型:
java.lang.UnsupportedOperationException:
No Encoder found for scala.collection.immutable.Set[java.lang.String]
这是data types supported by Spark SQL/DataFrame。也就是说,如果需要,您可以在Set
内使用UDF
。
在创建DataFrame时,Spark以类似的方式处理Seq,List,Array。如果您在以下3个数据框架上执行printSchema
和show
,您会发现它们完全相同。
sc.parallelize(Array(
(Array("A","D"),0.0) , (Array("C"),0.0), (Array("D"),1.0), (Array("B","A"),0.5)
)).toDF("channel_set", "rate")
sc.parallelize(List(
(List("A","D"),0.0) , (List("C"),0.0), (List("D"),1.0), (List("B","A"),0.5)
)).toDF("channel_set", "rate")
sc.parallelize(Seq(
(Seq("A","D"),0.0) , (Seq("C"),0.0), (Seq("D"),1.0), (Seq("B","A"),0.5)
)).toDF("channel_set", "rate")
// res.printSchema
// root
// |-- channel_set: array (nullable = true)
// | |-- element: string (containsNull = true)
// |-- rate: double (nullable = false)
// res.show
// +-----------+----+
// |channel_set|rate|
// +-----------+----+
// | [A, D]| 0.0|
// | [C]| 0.0|
// | [D]| 1.0|
// | [B, A]| 0.5|
// +-----------+----+