我有以下结构
json.select($"comments").printSchema
root
|-- comments: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- comment: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- score: string (nullable = true)
| | | |-- shouts: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- tags: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- text: string (nullable = true)
| | | |-- username: string (nullable = true)
| | |-- subcomments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- date: string (nullable = true)
| | | | |-- score: string (nullable = true)
| | | | |-- shouts: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- tags: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- text: string (nullable = true)
| | | | |-- username: string (nullable = true)
我想获得评论的数组/列表[用户名,分数,文本]。通常,在pyspark我会做这样的事情
comments = json
.select("comments")
.flatMap(lambda element:
map(lambda comment:
Row(username = comment.username,
score = comment.score,
text = comment.text),
element[0])
.toDF()
但是,当我在scala中尝试相同的方法时
json.select($"comments").rdd.map{row: Row => row(0)}.take(3)
我有一些奇怪的输出
Array[Any] =
Array(
WrappedArray([[stirng,string,WrappedArray(),WrappedArray(),,string] ...], ...)
有没有办法在scala中执行该任务就像使用python一样简单?
另外,如何像数组/列表一样迭代WrappedArray,我有这样的错误
rror: scala.collection.mutable.WrappedArray.type does not take parameters
答案 0 :(得分:2)
如何使用静态类型Dataset
呢?
case class Comment(
date: String, score: String,
shouts: Seq[String], tags: Seq[String],
text: String, username: String
)
df
.select(explode($"comments.comment").alias("comment"))
.select("comment.*")
.as[Comment]
.map(c => (c.username, c.score, c.date))
如果您不依赖于REPL,可以进一步简化:
df
.select("comments.comment")
.as[Seq[Comment]]
.flatMap(_.map(c => (c.username, c.score, c.text)))
如果你真的想要处理Rows
使用类型化的getter:
df.rdd.flatMap(
_.getAs[SR]("comments")
.map(_.getAs[Row]("comment"))
.map {
// You could also _.getAs[String]("score") or getString(0)
case Row(_, score: String, _, _, text: String, username: String) =>
(username, score, text)
}
)