过滤pyspark数据帧

时间:2016-11-02 17:17:40

标签: select apache-spark pyspark spark-dataframe apache-spark-2.0

我正在尝试根据一些规则从pyspark数据框中选择一些值。在pyspark获得例外。

from pyspark.sql import functions as F

df.select(df.card_key,F.when((df.tran_sponsor = 'GAMES') &  (df.location_code = '9145'),'ENTERTAINMENT').when((df.tran_sponsor = 'XYZ') &  (df.location_code = '123'),'eBOOKS').when((df.tran_sponsor = 'XYZ') &  (df.l_code.isin(['123', '234', '345', '456', '567', '678', '789', '7878', '67', '456']) ),'FINANCE').otherwise(df.tran_sponsor)).show()

我遇到以下异常。你能提一些建议吗?

  

文件"",第1行       df.select(df.card_key,F.when((df.tran_sponsor =' GAMES')&(df.location_code =' 9145'),' ENTERTAINMENT' ;)。((df.tran_sponsor =' XYZ')&(df.location_code =' 123'),' eBOOKS')。when((df .tran_sponsor =' XYZ')&(df.l_code.isin([' 6001',' 6002',' 6003',& #39; 6004',' 6005',' 6006',' 6007',' 6008',' 6009&# 39;,' 6010',' 6011',' 6012',' 6013',' 6014'])) '作者&#39)否则(df.tran_sponsor))示出了()。                                                     ^   SyntaxError:语法无效

1 个答案:

答案 0 :(得分:2)

嗯,我刚想通了,问题在于赋值算子没有问题:(

df.select(df.card_key,F.when((df.tran_sponsor == 'GAMES') &  (df.location_code == '9145'),'ENTERTAINMENT').when((df.tran_sponsor == 'XYZ') &  (df.location_code == '123'),'eBOOKS').when((df.tran_sponsor == 'XYZ') &  (df.l_code.isin(['123', '234', '345', '456', '567', '678', '789', '7878', '67', '456']) ),'FINANCE').otherwise(df.tran_sponsor)).show()

效果很好,感谢有人在努力研究它。