Pyspark DataFrame条件组

时间:2017-10-18 17:27:53

标签: dataframe pyspark

from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat")
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
    ([
        row(1,1,'12/10/2016',"A"),
        row(1,2,None,'A'),
        row(2,1,'14/10/2016','B'),
        row(3,3,'!~2016/2/276','B'),
        row(None,1,'26/09/2016','A'),
        row(1,1,'12/10/2016',"A"),
        row(1,2,None,'A'),
        row(2,1,'14/10/2016','B'),
        row(None,None,'!~2016/2/276','B'),
        row(None,1,'26/09/2016','A')
        ]).toDF())
tdf.groupBy(  iff(len(agg.strip()) > 0 , F.col(agg),  )).agg(F.count('*').alias('row_count')).show()

有没有办法根据数据框组中的某些条件使用列或没有列?

1 个答案:

答案 0 :(得分:1)

如果您要查找的条件不符合groupBy没有列,则可以向groupBy提供一个空列表:

tdf.groupBy(agg if len(agg) > 0 else []).agg(...)
agg = ''
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---------+
|row_count|
+---------+
|       10|
+---------+

agg = 'Cat'
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---+---------+
|Cat|row_count|
+---+---------+
|  B|        4|
|  A|        6|
+---+---------+