N次复制Spark Dataset中的行

时间:2019-02-27 17:55:34

标签: scala apache-spark

当我尝试在Spark中执行以下操作时:

val replicas = 10
val dsReplicated = ds flatMap (a => 0 until replicas map ((a, _)))

我收到以下异常:

java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
  at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
  at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
  at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
  at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
  at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
  at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
  at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
  ... 48 elided

我可以使用带有explode函数的Spark DataFrame实现此目的。我想使用数据集实现类似的目的。

作为参考,以下是使用DataFrame API复制行的代码:

val dfReplicated = df.
      withColumn("__temporarily__", typedLit((0 until replicas).toArray)).
      withColumn("idx", explode($"__temporarily__")).
      drop($"__temporarily__")

1 个答案:

答案 0 :(得分:1)

这是一种实现方式:

case class Zip(zipcode: String)
case class Person(id: Int,name: String,zipcode: List[Zip])

data: org.apache.spark.sql.Dataset[Person]
data.show()

+---+----+--------------+
| id|name|       zipcode|
+---+----+--------------+
|  1| AAA|[[MVP], [RB2]]|
|  2| BBB|[[KFG], [YYU]]|
|  3| CCC|[[JJJ], [7IH]]|
+---+----+--------------+  

data.printSchema

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- zipcode: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- zipcode: string (nullable = true)

val df = data.withColumn("ArrayCol",explode($"zipcode"))
df.select($"id",$"name",$"ArrayCol.zipcode").show()

输出:

+---+----+-------+
| id|name|zipcode|
+---+----+-------+
|  1| AAA|    MVP|
|  1| AAA|    RB2|
|  2| BBB|    KFG|
|  2| BBB|    YYU|
|  3| CCC|    JJJ|
|  3| CCC|    7IH|
+---+----+-------+

使用Dataset

val resultDS = data.flatMap(x => x.zipcode.map(y => (x.id,x.name,y.zipcode)))
resultDS.show(false)

//resultDS:org.apache.spark.sql.Dataset[(Int, String, String)] = 
//  [_1: integer, _2: string ... 1 more fields] 

//+---+---+---+
//|_1 |_2 |_3 |
//+---+---+---+
//|1  |AAA|MVP|
//|1  |AAA|RB2|
//|2  |BBB|KFG|
//|2  |BBB|YYU|
//|3  |CCC|JJJ|
//|3  |CCC|7IH|
//+---+---+---+