Pyspark-根据另一个数据框的行值过滤数据框

时间:2020-05-27 23:31:50

标签: pyspark apache-spark-sql pyspark-dataframes

我有一个主数据帧和一个辅助数据帧,我想逐行浏览它们,根据每一行中的值过滤主数据帧,在过滤后的主数据帧上运行一个函数,然后保存输出。

输出可以保存在单独的数据框中,也可以保存在辅助数据框的新列中。

# Master DF
df = pd.DataFrame({"Name": ["Mike", "Bob", "Steve", "Jim", "Dan"], "Age": [22, 44, 66, 22, 66], "Job": ["Doc", "Cashier", "Fireman", "Doc", "Fireman"]})

#Secondary DF
df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"]})

df = spark.createDataFrame(df)
+-----+---+-------+
| Name|Age|    Job|
+-----+---+-------+
| Mike| 22|    Doc|
|  Bob| 44|Cashier|
|Steve| 66|Fireman|
|  Jim| 22|    Doc|
|  Dan| 66|Fireman|
+-----+---+-------+

df1 = spark.createDataFrame(df1)
+---+-------+
|Age|    Job|
+---+-------+
| 22|    Doc|
| 66|Fireman|
+---+-------+
​
# Filter by values in first row of secondary DF
df_filt = df.filter(
    (F.col("Age") == 22) &                                
    (F.col('Job') == 'Doc')                          
)

# Run the filtered DF through my function
def my_func(df_filt):
    my_list = df_filt.select('Name').rdd.flatMap(lambda x: x).collect()
    return '-'.join(my_list)

# Output of function
my_func(df_filt)
'Mike-Jim'

# Filter by values in second row of secondary DF
df_filt = df.filter(
    (F.col("Age") == 66) &                                
    (F.col('Job') == 'Fireman')                          
)

# Output of function
my_func(df_filt)
'Steve-Dan'

# Desired output at the end of the iterations
new_df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"], "Returned_value": ['Mike-Jim', 'Steve-Dan']})

基本上,我想获取我的Master DF并以某些方式对其进行过滤,然后对过滤后的数据集运行算法,并获得该过滤的输出,然后继续进行下一组过滤并进行相同的操作。 / p>

解决此问题的最佳方法是什么?

1 个答案:

答案 0 :(得分:1)

尝试使用 join groupBy concat_ws/array_join {{ 1}}

collect_list