将多列转换为数据帧中的单列

时间:2019-11-25 11:42:31

标签: scala apache-spark apache-spark-sql

我有一种情况,我必须转换不同列中的数据才能显示在一列中。

以下是可用数据。

+-----------------------+----------+-----------------------+------+
|BaseTime               |SGNL_NAME |SGNL_TIME              |SGNL_V|
+-----------------------+----------+-----------------------+------+
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:17.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:17.645|0.0   |
+-----------------------+----------+-----------------------+------+

预期的输出如下:在其中以NAME,TIME和V的组合作为数组元素创建新列的地方。

"SGNL": [
        {
            "SGNL_NAME ": "Acc       ",
            "SGNL_TIME ": 1574128316834,
            "SGNL_V": 0.0
        }
       ]


+-----------------------+-----------------------------------------------------------------+
|BaseTime               |SGNL                                                             |
+-----------------------+-----------------------------------------------------------------+
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
+-----------------------------------------------------------------------------------------+

输入模式如下所示

root
 |-- BaseTime: timestamp (nullable = true)
 |-- SGNL_NAME: string (nullable = true)
 |-- SGNL_TIME: timestamp (nullable = true)
 |-- SGNL_V: string (nullable = true)

我正在尝试编写UDF以合并行,是否有其他可用的解决方案?

3 个答案:

答案 0 :(得分:1)

您可以使用to_JSON将多个列转换为JSON,如下所示

val df = sc.parallelize(Seq(
     |   (32.0, 31.0, 14.0), (3.6, 2.8, 0.0), (4.5, 5.0, -1.2)
     | )).toDF


scala> df.show(10)
+----+----+----+                                                                
|  _1|  _2|  _3|
+----+----+----+
|32.0|31.0|14.0|
| 3.6| 2.8| 0.0|
| 4.5| 5.0|-1.2|
+----+----+----+

scala> df.select(to_json(struct($"_1", $"_2", $"_3"))).show(10)
+------------------------------------------------------------------------------------------------+
|structstojson(named_struct(NamePlaceholder(), _1, NamePlaceholder(), _2, NamePlaceholder(), _3))|
+------------------------------------------------------------------------------------------------+
|                                                                            {"_1":32.0,"_2":3...|
|                                                                            {"_1":3.6,"_2":2....|
|                                                                            {"_1":4.5,"_2":5....|
+------------------------------------------------------------------------------------------------+

val DecimalType = DataTypes.createDecimalType(2, 1)

val schema = StructType(Seq(StructField("_1", DecimalType, true), StructField("_2", DecimalType, true), StructField("_3", DecimalType, true)))

new_df.withColumn("final_array", from_json($"final", schema)).show(10)

希望这很有用。

答案 1 :(得分:1)

scala> df.show(false)
+----------------------+---------+----------------------+------+
|BaseTime              |SGNL_NAME|SGNL_TIME             |SGNL_V|
+----------------------+---------+----------------------+------+
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:17.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:17.645|0.0   |
+----------------------+---------+----------------------+------+


scala> val df1 =  df.withColumn("SGNL_NAME", regexp_replace(regexp_replace(to_json(struct("SGNL_NAME")), "\\{", ""),"\\}", ""))
                    .withColumn("SGNL_TIME", regexp_replace(regexp_replace(to_json(struct("SGNL_TIME")), "\\{", ""),"\\}", ""))
                    .withColumn("SGNL_V", regexp_replace(regexp_replace(to_json(struct("SGNL_V")), "\\{", ""),"\\}", ""))


scala> df1.show(false)
+----------------------+-----------------+------------------------------------+--------------+
|BaseTime              |SGNL_NAME        |SGNL_TIME                           |SGNL_V        |
+----------------------+-----------------+------------------------------------+--------------+
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:17.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:17.645"|"SGNL_V":"0.0"|
+----------------------+-----------------+------------------------------------+--------------+


scala> val df2 = df1.withColumn("SGNL", struct("SGNL_NAME", "SGNL_TIME", "SGNL_V"))
                     .drop("SGNL_NAME","SGNL_TIME","SGNL_V")

scala> df2.show(false)
+----------------------+-------------------------------------------------------------------------+
|BaseTime              |SGNL                                                                     |
+----------------------+-------------------------------------------------------------------------+
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:17.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:17.645", "SGNL_V":"0.0"]|
+----------------------+-------------------------------------------------------------------------+


scala> df2.printSchema
root
 |-- BaseTime: string (nullable = true)
 |-- SGNL: struct (nullable = false)
 |    |-- SGNL_NAME: string (nullable = true)
 |    |-- SGNL_TIME: string (nullable = true)
 |    |-- SGNL_V: string (nullable = true)

答案 2 :(得分:1)

UDF的替代方法是使用org.apache.spark.sql.functions包中的函数,例如to_json()struct()array()。这是一个完整的工作示例:

val df = sc.parallelize(Seq(
  ("2019-11-21 18:19:15.817", "Acc", "2019-11-21 18:18:16.645", 0.0)
)).toDF("BaseTime", "SGNL_NAME", "SGNL_TIME", "SGNL_V")

val result = df.withColumn("SGNL", to_json(
  array(
    struct("SGNL_NAME", "SGNL_TIME", "SGNL_V")
  )
)).drop("SGNL_NAME","SGNL_TIME","SGNL_V")

result.show(false)给出了预期的结果:

+-----------------------+------------------------------------------------------------------------+
|BaseTime               |SGNL                                                                    |
+-----------------------+------------------------------------------------------------------------+
|2019-11-21 18:19:15.817|[{"SGNL_NAME":"Acc","SGNL_TIME":"2019-11-21 18:18:16.645","SGNL_V":0.0}]|
+-----------------------+------------------------------------------------------------------------+