数据模式
root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|id|col1 |col2 |
|1 |["x","y","z"]|[123,"null","null"]|
从上面的数据中,我想过滤x在col1中退出的位置以及x从col2中的相应值。 (col1和col2的值排序。如果col1中的x索引2和col2中的x索引也为2)
结果:(需要col1和col2类型的数组类型)
|id |col1 |col2 |
|1 |["x"]|[123]|
如果col1中不存在x,则需要类似
的结果|id| col1 |col2 |
|1 |["null"] |["null"]|
我尝试过
val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))
答案 0 :(得分:1)
诀窍是将您的数据从哑string
列转换为更有用的数据结构。将col1
和col2
重建为数组(或映射为地图,如所需的输出所示)后,您可以使用Spark的内置函数,而不必使用@baitmbarek建议的凌乱的UDF。
首先,使用trim
和split
将col1
和col2
转换为数组:
scala> val df = Seq(
| ("1", """["x","y","z"]""","""[123,"null","null"]"""),
| ("2", """["a","y","z"]""","""[123,"null","null"]""")
| ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]
scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
.withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_array.show(false)
+---+---------+-----------------+
|id |col1 |col2 |
+---+---------+-----------------+
|1 |[x, y, z]|[123, null, null]|
|2 |[a, y, z]|[123, null, null]|
+---+---------+-----------------+
scala> df_array.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
从这里开始,您应该能够使用array_position
在col1
中找到'x'的索引(如果有)并从col2
检索匹配的数据。但是,首先将两个数组转换为映射应该更清楚地了解您的代码在做什么:
scala> val df_map = df_array.select(
$"id",
map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
)
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]
scala> df_map.show(false)
+---+--------------------------------+
|id |col_map |
+---+--------------------------------+
|1 |[x -> 123, y -> null, z -> null]|
|2 |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
$"id",
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(lit("x")))
.as("col1"),
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(element_at($"col_map", "x")))
.as("col2")
)
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_final.show
+---+------+------+
| id| col1| col2|
+---+------+------+
| 1| [x]| [123]|
| 2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = false)
| |-- element: string (containsNull = false)
|-- col2: array (nullable = false)
| |-- element: string (containsNull = true)
答案 1 :(得分:0)
不为我的代码感到骄傲,但是您可以尝试一下:
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("1", """["x","y","z"]""","""[123,"null","null"]"""),
("2", """["a","y","z"]""","""[123,"null","null"]""")
).toDF("id","col1","col2")
//step 2 : we define an UDF to find x's index and then, when it exists, the value in col2 at same index
val retrievePosX = udf{(col2: Seq[String], col1: Seq[String]) => col1.zipWithIndex.find(_._1 == "\"x\"")
.map{case (_, xpos) =>
Seq(col2(xpos))
}.getOrElse(Seq("\"null\""))}
//step 3 : when x is missing from col1, col1 is set to ["x"]. Could be way simpler but not sure what you intend to do, so creating an udf for this could make sense (or not)
val keepXinCol1 = udf{col: Seq[String] =>
col.find(_ == "\"x\"").map(Seq(_)).getOrElse(Seq.empty)}
//step 1 : col1 should become an array
df.withColumn("col1", split(trim($"col1","[]"), ","))
.withColumn("col2", retrievePosX(split(trim($"col2","[]"), ","), $"col1"))
.withColumn("col1", when($"col2" === array(lit("\"null\"")), $"col2").otherwise(keepXinCol1($"col1")))
.show
输出:
+---+--------+--------+
| id| col1| col2|
+---+--------+--------+
| 1| ["x"]| [123]|
| 2|["null"]|["null"]|
+---+--------+--------+