我有一个下面的pyspark数据框,我需要创建新列(new_col),这是X和Y列中的常见项目,Z列中的项目除外。
df
id X Y Z new_col
1 [12,23,1,24] [13,412,12,23,24] [12] [23,24]
2 [1,2,3] [2,4,5,6] [] [2]
答案 0 :(得分:3)
如果您的架构如下:
df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- X: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- Y: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- Z: array (nullable = true)
# | |-- element: long (containsNull = true)
以及您的pyspark 2.4+版本,您可以使用array_intersect
和array_except
:
from pyspark.sql.functions import array_except, array_intersect
df=df.withColumn("new_col", array_except(array_intersect("X", "Y"), "Z"))
df.show()
#+---+---------------+---------------------+----+--------+
#|id |X |Y |Z |new_col |
#+---+---------------+---------------------+----+--------+
#|1 |[12, 23, 1, 24]|[13, 412, 12, 23, 24]|[12]|[23, 24]|
#|2 |[1, 2, 3] |[2, 4, 5, 6] |[] |[2] |
#+---+---------------+---------------------+----+--------+
答案 1 :(得分:0)
您可以使用withcolumn + udf
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType
def intersection_function(list1, list2):
intersection_list = [value for value in list1 if value in list2]
return intersection_list
udf_intersection = F.udf(intersection_function, ArrayType(IntegerType())
newdf = df.withColumn("new_col", udf_intersection(df["ListColumn1"], df["ListColumn2"]))