如何使用Spark数据帧将行数据帧转换为数组Json输出

时间:2016-11-11 10:10:13

标签: scala apache-spark apache-spark-sql spark-dataframe rdd

我有转换行数据帧的代码,但我在数组中输出有问题。

输入:file.txt

+-------------------------------+--------------------+-------+
|id                             |var                 |score  |
+-------------------------------+--------------------+-------+
|12345                          |A                   |8      |
|12345                          |B                   |9      |
|12345                          |C                   |7      |
|12345                          |D                   |6      |
+-------------------------------+--------------------+-------+

输出:

{"id":"12345","props":[{"var":"A","score":"8"},{"var":"B","score":"9"},{"var":"C","score":"7"},{"var":"D","score":"6"}]}

我尝试使用collect_lis不成功。我的代码是scala

val sc = new SparkContext(conf);
val sqlContext = new HiveContext(sc)

val df = sqlContext.read.json("file.txt")
val dfCol = df.select(
    df("id"),
    df("var"),
    df("score"))
dfCol.show(false)

val merge = udf { (var: String, score: Double) =>
      {
        var + "," + score     }
    }

val grouped = dfCol.groupBy(col("id"))
      .agg(collect_list(merge(col("var"),col("score")).alias("props"))
grouped.show(false)

我的问题是,数据行如何在输出数组json中转换?

感谢。

1 个答案:

答案 0 :(得分:0)

哦,我的问题已经回答了。

            case class Props(var: String, score: Double)
            case class PropsArray(id: String, props: Seq[Props])
            val sc = new SparkContext(conf);
            val sqlContext = new HiveContext(sc)

            val df = sqlContext.read.json("file.txt")
            val dfCol = df.select(
                df("id"),
                df("var"),
                df("score"))


            val merge = udf { (var: String, score: Double) =>
                  {
                    var + "," + score     }
                }

            val grouped = dfCol.groupBy(col("id"))
                  .agg(concat_ws("|", collect_list(merge(col("var"), col("score")))).alias("props"))

        val merging = grouped.map(x => {
              val list: ListBuffer[Props] = ListBuffer()
              val data = x.getAs[String]("props").split("\\|")

              data.foreach { x =>
                val arr = x.split(",")

                try {

                  list.+=:(Props(arr.apply(0).toString(),arr.apply(1).toDouble))

                } catch {
                  case t: Throwable => t.getMessage
                }

              }

              PropsArray(x.getAs("id"), list.toSeq)

            }).toDF()

你可以运行

merging.show(false)

并且您必须在pom.xml中添加库

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>1.6.0</version>
            <exclusions>
                <exclusion>
                    <artifactId>kryo</artifactId>
                    <groupId>com.esotericsoftware.kryo</groupId>
                </exclusion>
            </exclusions>
        </dependency>

感谢。