spark dataframe根据条件从多个列中选择值

时间:2019-11-24 17:09:39

标签: apache-spark apache-spark-sql

数据模式

root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)


|id|col1         |col2               |
|1 |["x","y","z"]|[123,"null","null"]|

从上面的数据中,我想过滤x在col1中退出的位置以及x从col2中的相应值。 (col1和col2的值排序。如果col1中的x索引2和col2中的x索引也为2)

结果:(需要col1和col2类型的数组类型)

|id |col1 |col2 |
|1  |["x"]|[123]|

如果col1中不存在x,则需要类似

的结果
|id| col1    |col2 |
|1 |["null"] |["null"]|

我尝试过

val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))

2 个答案:

答案 0 :(得分:1)

诀窍是将您的数据从哑string列转换为更有用的数据结构。将col1col2重建为数组(或映射为地图,如所需的输出所示)后,您可以使用Spark的内置函数,而不必使用@baitmbarek建议的凌乱的UDF。

首先,使用trimsplitcol1col2转换为数组:

scala> val df = Seq(
     |       ("1", """["x","y","z"]""","""[123,"null","null"]"""),
     |         ("2", """["a","y","z"]""","""[123,"null","null"]""")
     |     ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]

scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
                        .withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]

scala> df_array.show(false)
+---+---------+-----------------+
|id |col1     |col2             |
+---+---------+-----------------+
|1  |[x, y, z]|[123, null, null]|
|2  |[a, y, z]|[123, null, null]|
+---+---------+-----------------+


scala> df_array.printSchema
root
 |-- id: string (nullable = true)
 |-- col1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)

从这里开始,您应该能够使用array_positioncol1中找到'x'的索引(如果有)并从col2检索匹配的数据。但是,首先将两个数组转换为映射应该更清楚地了解您的代码在做什么:

scala> val df_map = df_array.select(
                        $"id", 
                        map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
                        )
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]

scala> df_map.show(false)
+---+--------------------------------+
|id |col_map                         |
+---+--------------------------------+
|1  |[x -> 123, y -> null, z -> null]|
|2  |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
                                $"id",
                                when(isnull(element_at($"col_map", "x")), 
                                    array(lit("null")))
                                .otherwise(
                                    array(lit("x")))
                                .as("col1"),  
                                when(isnull(element_at($"col_map", "x")), 
                                    array(lit("null")))
                                .otherwise(
                                    array(element_at($"col_map", "x")))
                                .as("col2")
                                )
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]

scala> df_final.show
+---+------+------+
| id|  col1|  col2|
+---+------+------+
|  1|   [x]| [123]|
|  2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
 |-- id: string (nullable = true)
 |-- col1: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- col2: array (nullable = false)
 |    |-- element: string (containsNull = true)

答案 1 :(得分:0)

不为我的代码感到骄傲,但是您可以尝试一下:

import sparkSession.implicits._
import org.apache.spark.sql.functions._

val df = Seq(
      ("1", """["x","y","z"]""","""[123,"null","null"]"""),
        ("2", """["a","y","z"]""","""[123,"null","null"]""")
    ).toDF("id","col1","col2")

//step 2 : we define an UDF to find x's index and then, when it exists, the value in col2 at same index
val retrievePosX = udf{(col2: Seq[String], col1: Seq[String]) => col1.zipWithIndex.find(_._1 == "\"x\"")
      .map{case (_, xpos) =>
        Seq(col2(xpos))
      }.getOrElse(Seq("\"null\""))}

//step 3 : when x is missing from col1, col1 is set to ["x"]. Could be way simpler but not sure what you intend to do, so creating an udf for this could make sense (or not)
val keepXinCol1 = udf{col: Seq[String] =>
      col.find(_ == "\"x\"").map(Seq(_)).getOrElse(Seq.empty)}

//step 1 : col1 should become an array
df.withColumn("col1", split(trim($"col1","[]"), ","))
      .withColumn("col2", retrievePosX(split(trim($"col2","[]"), ","), $"col1"))
      .withColumn("col1", when($"col2" === array(lit("\"null\"")), $"col2").otherwise(keepXinCol1($"col1")))
      .show

输出:

+---+--------+--------+
| id|    col1|    col2|
+---+--------+--------+
|  1|   ["x"]|   [123]|
|  2|["null"]|["null"]|
+---+--------+--------+
相关问题