通过列表字典迭代过滤火花数据框

时间:2021-04-30 08:17:59

标签: python apache-spark pyspark apache-spark-sql

我有一个看起来像这样的字典 a_dict={"E1":["a",10,20,"red"],"E2":["b", 7, 14,"green"],"E3":["c",40,50,"blue"]] 但更长的时间,我想同时过滤每个列表元组的火花数据框。让我们看一个数据框的例子:

+----------------------+
|   User| value| color |
+----------------------+
|  a|     12|       red|
|  a|     21|       red|
|  b|      8|     green|
|  b|     13|     green|
|  c|     41|      blue|
|  b|     72|       red|
|  c|     52|      blue|
|  a|     13|    yellow|
+----------------------+

我现在正在做的是:

for key, value in a_dict.items():
  df=df.filter((df.user == value[0]) 
          & (df.value > value[1]) 
          & (df.value< value[2]) 
          &  (df.color==value[3]))

dummy df 输出应该是这样的:

+----------------------+
|   User| value| color |
+----------------------+
|  a|     12|       red|
|  b|      8|     green|
|  b|     13|     green|
|  c|     41|      blue|
+----------------------+

我想知道是否有更快的方法而不使用 for 循环并每次重新分配数据帧。

1 个答案:

答案 0 :(得分:1)

您可以从字典值创建一个数据框,并进行半连接以过滤原始数据框:

a_dict = {"E1":["a",10,20,"red"],"E2":["b", 7, 14,"green"],"E3":["c",40,50,"blue"]}

df2 = spark.createDataFrame(a_dict.values(), ['user', 'value1', 'value2', 'color'])

result = df.join(df2, 
    (df['user'] == df2['user']) & 
    (df['color'] == df2['color']) & 
    (df['value'].between(df2['value1'], df2['value2'])),
    'left_semi'
)

result.show()
+----+-----+-----+
|User|value|color|
+----+-----+-----+
|   c|   41| blue|
|   b|    8|green|
|   b|   13|green|
|   a|   12|  red|
+----+-----+-----+