将输入Spark Dataframe格式转换为dataframe1
+-----+---------------+------------------------------------------------------------------------------------------------------------------+
|table| err_timestamp| err_message |
+-----+---------------+------------------------------------------------------------------------------------------------------------------+
| t1|7/26/2017 13:56|[error = RI_VIOLATION, field = user_id, value = 'null'] |
| t2|7/26/2017 13:58|[error = NULL_CHECK, field = geo_id, value = 'null'] [error = DATATYPE_CHECK, field = emp_id, value = 'FIWOERE8'] |
+-----+---------------+------------------------------------------------------------------------------------------------------------------+
输出dataframe2作为整个行和列的转置,如下所示。
+-----+--------------+---------+--------------+-----------+
|table| err_date|err_field| err_type| err_value|
+-----+--------------+---------+--------------+-----------+
| t1|7/26/2017 0:00| user_id| RI_VOILATION| null|
| t2|7/26/2017 0:00| geo_id| NULL_CHECK| null|
| t2|7/26/2017 0:00| emp_id|DATATYPE_CHECK|FDSADFSDA68|
+-----+--------------+---------+--------------+-----------+
答案 0 :(得分:1)
以下是您需要的解决方案,在某些情况下您仍然可以最小化步骤。
import spark.implicits._
//create dummy data
val df = spark.sparkContext.parallelize(Seq(
("t1", "7/26/2017 13:56", "[error = RI_VIOLATION, field = user_id, value = null]"),
("t2", "7/26/2017 13:58", "[error = NULL_CHECK, field = geo_id, value = null] [error = DATATYPE_CHECK, field = emp_id, value = FIWOERE8]")
)).toDF("table", "err_timestamp", "err_message")
//create a udf to split string and create a array of string
val splitValue = udf ((value : String ) => {
"\\[(.*?)\\]".r.findAllMatchIn(value)
.map(x => x.toString().replaceAll("\\[", "").replaceAll("\\]", ""))
.toSeq
})
//update a column with explode to arrays of string
val df1 = df.withColumn("err_message", explode(splitValue($"err_message")))
df1.show(false)
+-----+---------------+--------------------------------------------------------+
|table|err_timestamp |err_message |
+-----+---------------+--------------------------------------------------------+
|t1 |7/26/2017 13:56|error = RI_VIOLATION, field = user_id, value = null |
|t2 |7/26/2017 13:58|error = NULL_CHECK, field = geo_id, value = null |
|t2 |7/26/2017 13:58|error = DATATYPE_CHECK, field = emp_id, value = FIWOERE8|
+-----+---------------+--------------------------------------------------------+
val splitExpr = split($"err_message", ",")
//create a three new columns with splitting in key value
df1.withColumn("err_field", split(splitExpr(1), "=")(1))
.withColumn("err_type", split(splitExpr(0), "=")(1))
.withColumn("err_value", split(splitExpr(2), "=")(1))
.drop("err_message")
.show(false)
输出:
+-----+---------------+---------+---------------+---------+
|table|err_timestamp |err_field|err_type |err_value|
+-----+---------------+---------+---------------+---------+
|t1 |7/26/2017 13:56| user_id | RI_VIOLATION | null |
|t2 |7/26/2017 13:58| geo_id | NULL_CHECK | null |
|t2 |7/26/2017 13:58| emp_id | DATATYPE_CHECK| FIWOERE8|
+-----+---------------+---------+---------------+---------+
希望这有帮助!
答案 1 :(得分:-1)
{{1}}