如何检查Dataframe中列之间的布尔条件

时间:2019-06-03 10:19:47

标签: apache-spark pyspark

我有要检查列之间条件的数据框:

+---+----+------+---------+------+
| ID|Name|Salary|Operation|Points|
+---+----+------+---------+------+
|  1|   A| 10000|  a AND b|   100|
|  1|   A| 10000|   a OR b|   200|
|  1|   A| 10000|otherwise|     0|
|  2|   B|   200|  a AND b|   100|
|  2|   B|   200|   a OR b|   200|
|  2|   B|   200|otherwise|     0|
|  3|   C|   700|  a AND b|   100|
|  3|   C|   700|   a OR b|   200|
|  3|   C|   700|otherwise|     0|
|  4|   D|  1000|  a AND b|   100|
|  4|   D|  1000|   a OR b|   200|
|  4|   D|  1000|otherwise|     0|
|  5|   E|   650|  a AND b|   100|
|  5|   E|   650|   a OR b|   200|
|  5|   E|   650|otherwise|     0|
+---+----+------+---------+------+

位置:

a='salary==1000'
b='salary>500'

如果操作为true,则将分配点,并按名称奖励在数据框中添加新列 例如 如果出现第一个项,薪水为10000,则检查条件a,如果薪水等于1000,且薪水大于500,则a AND b为假,所以赋0点 结果:

+---+----+------+------+
| ID|Name|Salary|Reward|
+---+----+------+------+
|  1|   A| 10000|   200|
|  2|   B|   200|     0|
|  3|   C|   700|   200|
|  4|   D|  1000|   200|
|  5|   E|   650|   200|
+---+----+------+------+

1 个答案:

答案 0 :(得分:0)

您可以将filter表达式和groupby组合在一起:

import pyspark.sql.functions as F

l = [
(  1,   'A', 10000,  'a AND b',   100),
(  1,   'A', 10000,   'a OR b',   200),
(  1,   'A', 10000,'otherwise',     0),
(  2,   'B',   200,  'a AND b',   100),
(  2,   'B',   200,   'a OR b',   200),
(  2,   'B',   200,'otherwise',     0),
(  3,   'C',   700,  'a AND b',   100),
(  3,   'C',   700,   'a OR b',   200),
(  3,   'C',   700,'otherwise',     0),
(  4,   'D',  1000,  'a AND b',   100),
(  4,   'D',  1000,   'a OR b',   200),
(  4,   'D',  1000,'otherwise',     0),
(  5,   'E',   650,  'a AND b',   100),
(  5,   'E',   650,   'a OR b',   200),
(  5,   'E',   650,'otherwise',     0)]

columns = ['ID','Name','Salary','Operation','Points']

df=spark.createDataFrame(l, columns)

df.filter(
          (df.Operation.contains('AND')        & (df.Salary == 1000) & (df.Salary > 500))    |
          (df.Operation.contains('OR')         & ((df.Salary == 1000) | (df.Salary > 500)))  |
          df.Operation.contains('otherwise') 
          ).groupBy('ID', 'Name', 'Salary').agg(F.max('Points').alias('Rewards')).show()

输出:

+---+----+------+-------+ 
| ID|Name|Salary|Rewards| 
+---+----+------+-------+ 
|  1|   A| 10000|    200| 
|  2|   B|   200|      0| 
|  3|   C|   700|    200| 
|  5|   E|   650|    200| 
|  4|   D|  1000|    200| 
+---+----+------+-------+

也请看一下类似的问题和Shan的answer