我有两个spark数据帧:
df1 = sc.parallelize([
['a', '1', 'value1'],
['b', '1', 'value2'],
['c', '2', 'value3'],
['d', '4', 'value4'],
['e', '2', 'value5'],
['f', '4', 'value6']
]).toDF(('id1', 'id2', 'v1'))
df2 = sc.parallelize([
['a','1', 1],
['b','1', 1],
['y','2', 4],
['z','2', 4]
]).toDF(('id1', 'id2', 'v2'))
每个域都有 id1 和 id2 字段(可能包含很多ID)。 首先,我需要通过 id1 将df1与df2匹配。 然后,我需要通过 id2 等来匹配两个数据框中的所有不匹配记录。
我的方式是:
def joinA(df1,df2, field):
from pyspark.sql.functions import lit
L = 'L_'
R = 'R_'
Lfield = L+field
Rfield = R+field
# Taking field's names
df1n = df1.schema.names
df2n = df2.schema.names
newL = [L+fld for fld in df1n]
newR = [R+fld for fld in df2n]
# drop duplicates by input field
df1 = df1.toDF(*newL).dropDuplicates([Lfield])
df2 = df2.toDF(*newR).dropDuplicates([Rfield])
# matching records
df_full = df1.join(df2,df1[Lfield]==df2[Rfield],how = 'outer').cache()
# unmatched records from df1
df_left = df_full.where(df2[Rfield].isNull()).select(newL).toDF(*df1n)
# unmatched records from df2
df_right = df_full.where(df1[Lfield].isNull()).select(newR).toDF(*df2n)
# matched records and adding match level
df_inner = df_full.where(\
(~df1[Lfield].isNull()) & (~df2[Rfield].isNull())\
).withColumn('matchlevel',lit(field))
return df_left, df_inner, df_right
first_l,first_i,first_r = joinA(df1,df2,'id1')
second_l,second_i,second_r = joinA(first_l,first_r,'id2')
result = first_i.union(second_i)
有没有一种方法可以使它更容易? 还是一些标准的工具来完成这项工作?
谢谢
马克
答案 0 :(得分:0)
我还有另一种方法...但是我不确定这是否比您的解决方案更好:
from pyspark.sql import functions as F
id_cols = [cols for cols in df1.columns if cols != 'v1']
df1 = df1.withColumn("get_v2", F.lit(None))
df1 = df1.withColumn("match_level", F.lit(None))
for col in id_cols:
new_df1 = df1.join(
df2.select(
col,
"v2"
),
on=(
(df1[col] == df2[col])
& df1['get_v2'].isNull()
),
how='left'
)
new_df1 = new_df1.withColumn(
"get_v2",
F.coalesce(df1.get_v2, df2.v2)
).drop(df2[col]).drop(df2.v2)
new_df1 = new_df1.withColumn(
"match_level",
F.when(F.col("get_v2").isNotNull(), F.coalesce(F.col("match_level"), F.lit(col)))
)
df1 = new_df1
df1.show()
+---+---+---+------+------+-----------+
|id1|id2|id3| v1|get_v2|match_level|
+---+---+---+------+------+-----------+
| f| 4| 1|value6| 3| id3|
| d| 4| 1|value4| 3| id3|
| c| 2| 1|value3| 4| id2|
| c| 2| 1|value3| 4| id2|
| e| 2| 1|value5| 4| id2|
| e| 2| 1|value5| 4| id2|
| b| 1| 1|value2| 1| id1|
| a| 1| 1|value1| 1| id1|
+---+---+---+------+------+-----------+
这将导致N个连接,其中N是您获得的ID的数量。
EDIT :添加了match_level!