我有一种情况,我必须转换不同列中的数据才能显示在一列中。
以下是可用数据。
+-----------------------+----------+-----------------------+------+
|BaseTime |SGNL_NAME |SGNL_TIME |SGNL_V|
+-----------------------+----------+-----------------------+------+
|2019-11-21 18:19:15.817|Acc |2019-11-21 18:18:16.645|0.0 |
|2019-11-21 18:19:15.817|Acc |2019-11-21 18:18:16.645|0.0 |
|2019-11-21 18:19:15.817|Acc |2019-11-21 18:18:16.645|0.0 |
|2019-11-21 18:19:15.817|Acc |2019-11-21 18:18:17.645|0.0 |
|2019-11-21 18:19:15.817|Acc |2019-11-21 18:18:17.645|0.0 |
+-----------------------+----------+-----------------------+------+
预期的输出如下:在其中以NAME,TIME和V的组合作为数组元素创建新列的地方。
"SGNL": [
{
"SGNL_NAME ": "Acc ",
"SGNL_TIME ": 1574128316834,
"SGNL_V": 0.0
}
]
+-----------------------+-----------------------------------------------------------------+
|BaseTime |SGNL |
+-----------------------+-----------------------------------------------------------------+
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
+-----------------------------------------------------------------------------------------+
输入模式如下所示
root
|-- BaseTime: timestamp (nullable = true)
|-- SGNL_NAME: string (nullable = true)
|-- SGNL_TIME: timestamp (nullable = true)
|-- SGNL_V: string (nullable = true)
我正在尝试编写UDF以合并行,是否有其他可用的解决方案?
答案 0 :(得分:1)
您可以使用to_JSON将多个列转换为JSON,如下所示
val df = sc.parallelize(Seq(
| (32.0, 31.0, 14.0), (3.6, 2.8, 0.0), (4.5, 5.0, -1.2)
| )).toDF
scala> df.show(10)
+----+----+----+
| _1| _2| _3|
+----+----+----+
|32.0|31.0|14.0|
| 3.6| 2.8| 0.0|
| 4.5| 5.0|-1.2|
+----+----+----+
scala> df.select(to_json(struct($"_1", $"_2", $"_3"))).show(10)
+------------------------------------------------------------------------------------------------+
|structstojson(named_struct(NamePlaceholder(), _1, NamePlaceholder(), _2, NamePlaceholder(), _3))|
+------------------------------------------------------------------------------------------------+
| {"_1":32.0,"_2":3...|
| {"_1":3.6,"_2":2....|
| {"_1":4.5,"_2":5....|
+------------------------------------------------------------------------------------------------+
val DecimalType = DataTypes.createDecimalType(2, 1)
val schema = StructType(Seq(StructField("_1", DecimalType, true), StructField("_2", DecimalType, true), StructField("_3", DecimalType, true)))
new_df.withColumn("final_array", from_json($"final", schema)).show(10)
希望这很有用。
答案 1 :(得分:1)
scala> df.show(false)
+----------------------+---------+----------------------+------+
|BaseTime |SGNL_NAME|SGNL_TIME |SGNL_V|
+----------------------+---------+----------------------+------+
|2019-11-2118:19:15.817|Acc |2019-11-2118:18:16.645|0.0 |
|2019-11-2118:19:15.817|Acc |2019-11-2118:18:16.645|0.0 |
|2019-11-2118:19:15.817|Acc |2019-11-2118:18:16.645|0.0 |
|2019-11-2118:19:15.817|Acc |2019-11-2118:18:17.645|0.0 |
|2019-11-2118:19:15.817|Acc |2019-11-2118:18:17.645|0.0 |
+----------------------+---------+----------------------+------+
scala> val df1 = df.withColumn("SGNL_NAME", regexp_replace(regexp_replace(to_json(struct("SGNL_NAME")), "\\{", ""),"\\}", ""))
.withColumn("SGNL_TIME", regexp_replace(regexp_replace(to_json(struct("SGNL_TIME")), "\\{", ""),"\\}", ""))
.withColumn("SGNL_V", regexp_replace(regexp_replace(to_json(struct("SGNL_V")), "\\{", ""),"\\}", ""))
scala> df1.show(false)
+----------------------+-----------------+------------------------------------+--------------+
|BaseTime |SGNL_NAME |SGNL_TIME |SGNL_V |
+----------------------+-----------------+------------------------------------+--------------+
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:17.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:17.645"|"SGNL_V":"0.0"|
+----------------------+-----------------+------------------------------------+--------------+
scala> val df2 = df1.withColumn("SGNL", struct("SGNL_NAME", "SGNL_TIME", "SGNL_V"))
.drop("SGNL_NAME","SGNL_TIME","SGNL_V")
scala> df2.show(false)
+----------------------+-------------------------------------------------------------------------+
|BaseTime |SGNL |
+----------------------+-------------------------------------------------------------------------+
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:17.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:17.645", "SGNL_V":"0.0"]|
+----------------------+-------------------------------------------------------------------------+
scala> df2.printSchema
root
|-- BaseTime: string (nullable = true)
|-- SGNL: struct (nullable = false)
| |-- SGNL_NAME: string (nullable = true)
| |-- SGNL_TIME: string (nullable = true)
| |-- SGNL_V: string (nullable = true)
答案 2 :(得分:1)
UDF的替代方法是使用org.apache.spark.sql.functions
包中的函数,例如to_json()
,struct()
和array()
。这是一个完整的工作示例:
val df = sc.parallelize(Seq(
("2019-11-21 18:19:15.817", "Acc", "2019-11-21 18:18:16.645", 0.0)
)).toDF("BaseTime", "SGNL_NAME", "SGNL_TIME", "SGNL_V")
val result = df.withColumn("SGNL", to_json(
array(
struct("SGNL_NAME", "SGNL_TIME", "SGNL_V")
)
)).drop("SGNL_NAME","SGNL_TIME","SGNL_V")
result.show(false)
给出了预期的结果:
+-----------------------+------------------------------------------------------------------------+
|BaseTime |SGNL |
+-----------------------+------------------------------------------------------------------------+
|2019-11-21 18:19:15.817|[{"SGNL_NAME":"Acc","SGNL_TIME":"2019-11-21 18:18:16.645","SGNL_V":0.0}]|
+-----------------------+------------------------------------------------------------------------+