我有要检查列之间条件的数据框:
+---+----+------+---------+------+
| ID|Name|Salary|Operation|Points|
+---+----+------+---------+------+
| 1| A| 10000| a AND b| 100|
| 1| A| 10000| a OR b| 200|
| 1| A| 10000|otherwise| 0|
| 2| B| 200| a AND b| 100|
| 2| B| 200| a OR b| 200|
| 2| B| 200|otherwise| 0|
| 3| C| 700| a AND b| 100|
| 3| C| 700| a OR b| 200|
| 3| C| 700|otherwise| 0|
| 4| D| 1000| a AND b| 100|
| 4| D| 1000| a OR b| 200|
| 4| D| 1000|otherwise| 0|
| 5| E| 650| a AND b| 100|
| 5| E| 650| a OR b| 200|
| 5| E| 650|otherwise| 0|
+---+----+------+---------+------+
位置:
a='salary==1000'
b='salary>500'
如果操作为true,则将分配点,并按名称奖励在数据框中添加新列
例如
如果出现第一个项,薪水为10000,则检查条件a,如果薪水等于1000,且薪水大于500,则a AND b
为假,所以赋0点
结果:
+---+----+------+------+
| ID|Name|Salary|Reward|
+---+----+------+------+
| 1| A| 10000| 200|
| 2| B| 200| 0|
| 3| C| 700| 200|
| 4| D| 1000| 200|
| 5| E| 650| 200|
+---+----+------+------+
答案 0 :(得分:0)
import pyspark.sql.functions as F
l = [
( 1, 'A', 10000, 'a AND b', 100),
( 1, 'A', 10000, 'a OR b', 200),
( 1, 'A', 10000,'otherwise', 0),
( 2, 'B', 200, 'a AND b', 100),
( 2, 'B', 200, 'a OR b', 200),
( 2, 'B', 200,'otherwise', 0),
( 3, 'C', 700, 'a AND b', 100),
( 3, 'C', 700, 'a OR b', 200),
( 3, 'C', 700,'otherwise', 0),
( 4, 'D', 1000, 'a AND b', 100),
( 4, 'D', 1000, 'a OR b', 200),
( 4, 'D', 1000,'otherwise', 0),
( 5, 'E', 650, 'a AND b', 100),
( 5, 'E', 650, 'a OR b', 200),
( 5, 'E', 650,'otherwise', 0)]
columns = ['ID','Name','Salary','Operation','Points']
df=spark.createDataFrame(l, columns)
df.filter(
(df.Operation.contains('AND') & (df.Salary == 1000) & (df.Salary > 500)) |
(df.Operation.contains('OR') & ((df.Salary == 1000) | (df.Salary > 500))) |
df.Operation.contains('otherwise')
).groupBy('ID', 'Name', 'Salary').agg(F.max('Points').alias('Rewards')).show()
输出:
+---+----+------+-------+
| ID|Name|Salary|Rewards|
+---+----+------+-------+
| 1| A| 10000| 200|
| 2| B| 200| 0|
| 3| C| 700| 200|
| 5| E| 650| 200|
| 4| D| 1000| 200|
+---+----+------+-------+
也请看一下类似的问题和Shan的answer。