我有一个看起来像这样的字典 a_dict={"E1":["a",10,20,"red"],"E2":["b", 7, 14,"green"],"E3":["c",40,50,"blue"]]
但更长的时间,我想同时过滤每个列表元组的火花数据框。让我们看一个数据框的例子:
+----------------------+
| User| value| color |
+----------------------+
| a| 12| red|
| a| 21| red|
| b| 8| green|
| b| 13| green|
| c| 41| blue|
| b| 72| red|
| c| 52| blue|
| a| 13| yellow|
+----------------------+
我现在正在做的是:
for key, value in a_dict.items():
df=df.filter((df.user == value[0])
& (df.value > value[1])
& (df.value< value[2])
& (df.color==value[3]))
dummy df 输出应该是这样的:
+----------------------+
| User| value| color |
+----------------------+
| a| 12| red|
| b| 8| green|
| b| 13| green|
| c| 41| blue|
+----------------------+
我想知道是否有更快的方法而不使用 for 循环并每次重新分配数据帧。
答案 0 :(得分:1)
您可以从字典值创建一个数据框,并进行半连接以过滤原始数据框:
a_dict = {"E1":["a",10,20,"red"],"E2":["b", 7, 14,"green"],"E3":["c",40,50,"blue"]}
df2 = spark.createDataFrame(a_dict.values(), ['user', 'value1', 'value2', 'color'])
result = df.join(df2,
(df['user'] == df2['user']) &
(df['color'] == df2['color']) &
(df['value'].between(df2['value1'], df2['value2'])),
'left_semi'
)
result.show()
+----+-----+-----+
|User|value|color|
+----+-----+-----+
| c| 41| blue|
| b| 8|green|
| b| 13|green|
| a| 12| red|
+----+-----+-----+