如何从Spark数据帧中的其他列值创建新列?

时间:2017-07-26 10:30:27

标签: apache-spark spark-dataframe apache-spark-dataset

将输入Spark Dataframe格式转换为dataframe1

+-----+---------------+------------------------------------------------------------------------------------------------------------------+
|table|  err_timestamp|                 err_message                                                                                      |
+-----+---------------+------------------------------------------------------------------------------------------------------------------+
|   t1|7/26/2017 13:56|[error = RI_VIOLATION, field = user_id, value = 'null']                                                           |
|   t2|7/26/2017 13:58|[error = NULL_CHECK, field = geo_id, value = 'null'] [error = DATATYPE_CHECK, field = emp_id, value = 'FIWOERE8'] |
+-----+---------------+------------------------------------------------------------------------------------------------------------------+

输出dataframe2作为整个行和列的转置,如下所示。

+-----+--------------+---------+--------------+-----------+
|table|      err_date|err_field|      err_type|  err_value|
+-----+--------------+---------+--------------+-----------+
|   t1|7/26/2017 0:00|  user_id|  RI_VOILATION|       null|
|   t2|7/26/2017 0:00|   geo_id|    NULL_CHECK|       null|
|   t2|7/26/2017 0:00|   emp_id|DATATYPE_CHECK|FDSADFSDA68|
+-----+--------------+---------+--------------+-----------+

2 个答案:

答案 0 :(得分:1)

以下是您需要的解决方案,在某些情况下您仍然可以最小化步骤。

import spark.implicits._

//create dummy data 
val df = spark.sparkContext.parallelize(Seq(
  ("t1", "7/26/2017 13:56", "[error = RI_VIOLATION, field = user_id, value = null]"),
  ("t2", "7/26/2017 13:58", "[error = NULL_CHECK, field = geo_id, value = null] [error = DATATYPE_CHECK, field = emp_id, value = FIWOERE8]")
)).toDF("table", "err_timestamp", "err_message")

//create a udf to split string and create a array of string
val splitValue = udf ((value : String ) => {
  "\\[(.*?)\\]".r.findAllMatchIn(value)
    .map(x => x.toString().replaceAll("\\[", "").replaceAll("\\]", ""))
    .toSeq
})

//update a column with explode to arrays of string
val df1 = df.withColumn("err_message", explode(splitValue($"err_message")))

df1.show(false)
+-----+---------------+--------------------------------------------------------+
|table|err_timestamp  |err_message                                             |
+-----+---------------+--------------------------------------------------------+
|t1   |7/26/2017 13:56|error = RI_VIOLATION, field = user_id, value = null     |
|t2   |7/26/2017 13:58|error = NULL_CHECK, field = geo_id, value = null        |
|t2   |7/26/2017 13:58|error = DATATYPE_CHECK, field = emp_id, value = FIWOERE8|
+-----+---------------+--------------------------------------------------------+

val splitExpr = split($"err_message", ",")

//create a three new columns with splitting in key value
df1.withColumn("err_field", split(splitExpr(1), "=")(1))
  .withColumn("err_type", split(splitExpr(0), "=")(1))
  .withColumn("err_value", split(splitExpr(2), "=")(1))
  .drop("err_message")
  .show(false)

输出:

+-----+---------------+---------+---------------+---------+
|table|err_timestamp  |err_field|err_type       |err_value|
+-----+---------------+---------+---------------+---------+
|t1   |7/26/2017 13:56| user_id | RI_VIOLATION  | null    |
|t2   |7/26/2017 13:58| geo_id  | NULL_CHECK    | null    |
|t2   |7/26/2017 13:58| emp_id  | DATATYPE_CHECK| FIWOERE8|
+-----+---------------+---------+---------------+---------+

希望这有帮助!

答案 1 :(得分:-1)

{{1}}