如何转换DataFrame使结构成为列名及其值?

时间:2017-06-27 16:36:41

标签: scala apache-spark apache-spark-sql

我在斯卡拉玩Spark。我有这个结构:

case class MovieRatings(movieName: String, rating: Double)
case class MovieCritics(name: String, movieRatings: List[MovieRatings])

第一堂课是一部电影和一些评论家给出的评级。它可能是这样的:

MovieRatings("Logan", 1.5)

和第二个班级,接受评论家的名字和他分类的电影列表。有了这个,我想出了一个MovieCritics列表,列表中的每个元素都有一个名称和一个MovieRatings列表。 到现在为止还挺好。现在我想在spark dataFrame中转换此List,以更加用户友好的方式显示数据。像这样:

Critic | Logan | Zoolander | John Wick | ...
Manuel    1.5       3            2.5
John      2         3.5          3
...

第一栏反映了电影评论家,以下专栏代表了电影以及评论家给出的各自评级。 我的问题是如何转换

List(MovieCritics(name: String, movieRatings: List[MovieRatings]))

在那个观点中。

2 个答案:

答案 0 :(得分:1)

  

如何转换List(MovieCritics(name: String, movieRatings: List[MovieRatings]))

就像在toDS上使用List一样简单。只有当您在SparkSession的{​​{3}}范围内(再次)如下所示时,此功能才可用:

val sparkSession = SparkSession.builder.getOrCreate()
import sparkSession.implicits._

如果您使用scala.collection.immutable.Iterable[MovieCritics]或类似的集合数据结构,则必须在toSeq之前使用toArraytoDS对其进行“映射”,以便从“Iterable”中“转义”。 Iterables无法使用Implicits。

鉴于列表为critics,您必须执行以下操作:

critics.toDS
  

现在我想在spark dataFrame中转换此List,以更友好的方式显示数据。

这是你问题中最有趣的部分(花了我几个小时才能最终理解并编写解决方案)。我很感激评论让它更漂亮。

case class MovieRatings(movieName: String, rating: Double)
case class MovieCritics(name: String, movieRatings: Seq[MovieRatings])
val movies_critics = Seq(
  MovieCritics("Manuel", Seq(MovieRatings("Logan", 1.5), MovieRatings("Zoolander", 3), MovieRatings("John Wick", 2.5))),
  MovieCritics("John", Seq(MovieRatings("Logan", 2), MovieRatings("Zoolander", 3.5), MovieRatings("John Wick", 3))))

设置输入数据集后,就会出现解决方案。

val ratings = movies_critics.toDF
scala> ratings.show(false)
+------+-----------------------------------------------+
|name  |movieRatings                                   |
+------+-----------------------------------------------+
|Manuel|[[Logan,1.5], [Zoolander,3.0], [John Wick,2.5]]|
|John  |[[Logan,2.0], [Zoolander,3.5], [John Wick,3.0]]|
+------+-----------------------------------------------+

val ratingsCount = ratings.
  withColumn("size", size($"movieRatings")).
  select(max("size")).
  as[Int].
  head

val names_ratings = (0 until ratingsCount).
  foldLeft(ratings) { case (ds, counter) => ds.
    withColumn(s"name_$counter", $"movieRatings"(counter)("movieName")).
    withColumn(s"rating_$counter", $"movieRatings"(counter)("rating")) }

val movieColumns = names_ratings.
  columns.
  drop(1).
  filter(name => name.startsWith("name")).
  map(col)
val movieNames = names_ratings.select(movieColumns: _*).head.toSeq.map(_.toString)
val ratingNames = movieNames.indices.map(idx => s"rating_$idx")
val cols = movieNames.zip(ratingNames).map { case (movie, rn) =>
  col(rn) as movie
}
val solution = names_ratings.select(($"name" +: cols): _*)
scala> solution.show
+------+-----+---------+---------+
|  name|Logan|Zoolander|John Wick|
+------+-----+---------+---------+
|Manuel|  1.5|      3.0|      2.5|
|  John|  2.0|      3.5|      3.0|
+------+-----+---------+---------+

答案 1 :(得分:1)

如果您有电影数据

val movieCritics = List(
  MovieCritics("Manual", List(MovieRatings("Logan", 1.5), MovieRatings("Zoolander", 3), MovieRatings("John Wick", 2.5))),
  MovieCritics("John", List(MovieRatings("Logan", 2), MovieRatings("Zoolander", 3.5), MovieRatings("John Wick", 3)))
)

您只需将toDF调用为

即可创建数据框
import sqlContext.implicits._
val df = movieCritics.toDF

哪个应该是

+------+-----------------------------------------------+
|name  |movieRatings                                   |
+------+-----------------------------------------------+
|Manual|[[Logan,1.5], [Zoolander,3.0], [John Wick,2.5]]|
|John  |[[Logan,2.0], [Zoolander,3.5], [John Wick,3.0]]|
+------+-----------------------------------------------+

现在来自 dataframe 的简单select可以获得您需要的输出

import org.apache.spark.sql.functions._
df.select(col("name"), col("movieRatings")(0)("rating").as("Logan"), col("movieRatings")(1)("rating").as("Zoolander"), col("movieRatings")(2)("rating").as("John Wick")).show(false)

这会导致最终的数据框

+------+-----+---------+---------+
|name  |Logan|Zoolander|John Wick|
+------+-----+---------+---------+
|Manual|1.5  |3.0      |2.5      |
|John  |2.0  |3.5      |3.0      |
+------+-----+---------+---------+