我希望unique_ID上的2个数据帧(列)的交集值匹配,并将交集值存储在new_column-1中,还要在new_column_3中获得交集数据的计数。我在下面给出的数据框。我正在PySpark(DataBricks)中执行此代码。我不知道如何在pyspark上编写交集代码。感谢您的及时答复/支持。
Pos_id Emp_id skill_list_p skill_list_e
0 0 1 100 [a] [a, f, d]
3 1 101 [a] [a, b, e]
6 1 102 [a] [b, d, c]
1 0 2 100 [d, b] [a, f, d]
3 2 101 [d, b] [a, b, e]
6 2 102 [d, b] [b, d, c]
3 0 3 100 [c, d, a] [a, f, d]
3 3 101 [c, d, a] [a, b, e]
6 3 102 [c, d, a] [b, d, c]
6 0 4 100 [a, b] [a, f, d]
3 4 101 [a, b] [a, b, e]
6 4 102 [a, b] [b, d, c]
附加了预期的输出:
Pos_id Emp_id skill_list_p skill_list_e Matched Matched_skills_list Matched_Skills
0 0 1 100 ['a'] ['a' 'f' 'd'] 1 {'a'} a
0 3 1 101 ['a'] ['a' 'b' 'e'] 1 {'a'} a
0 6 1 102 ['a'] ['b' 'd' 'c'] 0 set()
1 0 2 100 ['d' 'b'] ['a' 'f' 'd'] 1 {'d'} d
1 3 2 101 ['d' 'b'] ['a' 'b' 'e'] 1 {'b'} b
1 6 2 102 ['d' 'b'] ['b' 'd' 'c'] 2 {'d', 'b'} d,b
3 0 3 100 ['c' 'd' 'a'] ['a' 'f' 'd'] 2 {'a', 'd'} a,d
3 3 3 101 ['c' 'd' 'a'] ['a' 'b' 'e'] 1 {'a'} a
3 6 3 102 ['c' 'd' 'a'] ['b' 'd' 'c'] 2 {'c', 'd'} c,d
6 0 4 100 ['a' 'b'] ['a' 'f' 'd'] 1 {'a'} a
6 3 4 101 ['a' 'b'] ['a' 'b' 'e'] 2 {'a', 'b'} a,b
6 6 4 102 ['a' 'b'] ['b' 'd' 'c'] 1 {'b'} b
答案 0 :(得分:0)
从如何在SQL中做到这一点的角度考虑可能会有所帮助。数据帧被设计为表。所描述的目标是创建一个新列,该列是对两个现有列进行转换后的结果。
在SQL 中,这看起来像
select "emp_id", transformation("skill_list_p", "skill_list_e") as "common_skills" from ...
鉴于这种方法,建议您查看Apache Spark™中提供的User Defined Functions (UDF's)。
答案 1 :(得分:0)
最简单的方法是在udf
中使用pyspark.sql.functions
这是一个例子。
from pyspark.sql import functions as F
from pyspark.sql import types as T
# Declare an udf which uses set.interection() in python to find intersection between arrays.
array_intersect = F.udf(lambda r1, r2: list(set(r1).intersection(set(r2))),
T.ArrayType(T.StringType()))
# Use the udf we declared before to generate a new column which is the intersection between
# skill_list_p and skill_list_e
df = df.withColumn('matched_skill_list',
array_intersect(F.col('skill_list_p'), F.col('skill_list_e')))
# Calculate the size of the intersection.
df = df.withColumn('matched', F.size(F.col('matched_skill_list')))
# Show the result
print(df.show())