我有转换行数据帧的代码,但我在数组中输出有问题。
输入:file.txt
+-------------------------------+--------------------+-------+
|id |var |score |
+-------------------------------+--------------------+-------+
|12345 |A |8 |
|12345 |B |9 |
|12345 |C |7 |
|12345 |D |6 |
+-------------------------------+--------------------+-------+
输出:
{"id":"12345","props":[{"var":"A","score":"8"},{"var":"B","score":"9"},{"var":"C","score":"7"},{"var":"D","score":"6"}]}
我尝试使用collect_lis不成功。我的代码是scala
val sc = new SparkContext(conf);
val sqlContext = new HiveContext(sc)
val df = sqlContext.read.json("file.txt")
val dfCol = df.select(
df("id"),
df("var"),
df("score"))
dfCol.show(false)
val merge = udf { (var: String, score: Double) =>
{
var + "," + score }
}
val grouped = dfCol.groupBy(col("id"))
.agg(collect_list(merge(col("var"),col("score")).alias("props"))
grouped.show(false)
我的问题是,数据行如何在输出数组json中转换?
感谢。
答案 0 :(得分:0)
哦,我的问题已经回答了。
case class Props(var: String, score: Double)
case class PropsArray(id: String, props: Seq[Props])
val sc = new SparkContext(conf);
val sqlContext = new HiveContext(sc)
val df = sqlContext.read.json("file.txt")
val dfCol = df.select(
df("id"),
df("var"),
df("score"))
val merge = udf { (var: String, score: Double) =>
{
var + "," + score }
}
val grouped = dfCol.groupBy(col("id"))
.agg(concat_ws("|", collect_list(merge(col("var"), col("score")))).alias("props"))
val merging = grouped.map(x => {
val list: ListBuffer[Props] = ListBuffer()
val data = x.getAs[String]("props").split("\\|")
data.foreach { x =>
val arr = x.split(",")
try {
list.+=:(Props(arr.apply(0).toString(),arr.apply(1).toDouble))
} catch {
case t: Throwable => t.getMessage
}
}
PropsArray(x.getAs("id"), list.toSeq)
}).toDF()
你可以运行
merging.show(false)
并且您必须在pom.xml中添加库
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
<exclusions>
<exclusion>
<artifactId>kryo</artifactId>
<groupId>com.esotericsoftware.kryo</groupId>
</exclusion>
</exclusions>
</dependency>
感谢。