当我尝试在Spark中执行以下操作时:
val replicas = 10
val dsReplicated = ds flatMap (a => 0 until replicas map ((a, _)))
我收到以下异常:
java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
... 48 elided
我可以使用带有explode
函数的Spark DataFrame实现此目的。我想使用数据集实现类似的目的。
作为参考,以下是使用DataFrame API复制行的代码:
val dfReplicated = df.
withColumn("__temporarily__", typedLit((0 until replicas).toArray)).
withColumn("idx", explode($"__temporarily__")).
drop($"__temporarily__")
答案 0 :(得分:1)
这是一种实现方式:
case class Zip(zipcode: String)
case class Person(id: Int,name: String,zipcode: List[Zip])
data: org.apache.spark.sql.Dataset[Person]
data.show()
+---+----+--------------+
| id|name| zipcode|
+---+----+--------------+
| 1| AAA|[[MVP], [RB2]]|
| 2| BBB|[[KFG], [YYU]]|
| 3| CCC|[[JJJ], [7IH]]|
+---+----+--------------+
data.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- zipcode: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- zipcode: string (nullable = true)
val df = data.withColumn("ArrayCol",explode($"zipcode"))
df.select($"id",$"name",$"ArrayCol.zipcode").show()
输出:
+---+----+-------+
| id|name|zipcode|
+---+----+-------+
| 1| AAA| MVP|
| 1| AAA| RB2|
| 2| BBB| KFG|
| 2| BBB| YYU|
| 3| CCC| JJJ|
| 3| CCC| 7IH|
+---+----+-------+
使用Dataset
:
val resultDS = data.flatMap(x => x.zipcode.map(y => (x.id,x.name,y.zipcode)))
resultDS.show(false)
//resultDS:org.apache.spark.sql.Dataset[(Int, String, String)] =
// [_1: integer, _2: string ... 1 more fields]
//+---+---+---+
//|_1 |_2 |_3 |
//+---+---+---+
//|1 |AAA|MVP|
//|1 |AAA|RB2|
//|2 |BBB|KFG|
//|2 |BBB|YYU|
//|3 |CCC|JJJ|
//|3 |CCC|7IH|
//+---+---+---+