pyspark选择具有更多匹配列字段的特定行

时间:2018-09-01 14:41:05

标签: apache-spark pyspark pyspark-sql

我有如下示例表(我有100万这样的行),我需要根据以下条件选择要添加到新数据框的行,

  1. 我必须选择参加更多课堂的前1000名学生

  2. 与其他班级相比,参加第1、2、3、4次课程的前1000名学生

因此在我的示例中,我需要将学生 123 678 的所有行存储到其他数据框中

我没有正确的逻辑

enter image description here

1 个答案:

答案 0 :(得分:0)

以下是您解决问题的方法,请告诉我是否有帮助

import pyspark.sql.functions as F
from pyspark.sql import Window

attended_more_classes = df.filter(
    F.col("check_in") == "y"
).groupby(
    "id"
).agg(
    F.countDistinct(F.col("class")).alias("class_count")
)

win = Window.partitionBy("id").orderBy(F.col("class_count").desc())

attended_more_classes = attended_more_classes.withColumn(
    "rank",
    F.rank().over(win)
).withColumn(
    "attended_more_class",
    F.when(
        F.col("rank")<=1000,
        F.lit("Y")
    )
)

# result of first part
attended_more_classes.show()

# answer start for second question

win2 = Window.partitionBy("id", "class").orderBy(F.col("class_count").desc())

filtered_students = df.filter(F.col("class").isin(1,2,3,4)).select("id").distinct()

aggregated_data2 = df.filter(
    F.col("check_in") == "y"
).groupby(
    "id",
    "class"
).agg(
    F.count(F.col("check_in")).alias("class_count")
).withColumn(
    "max_class",
    F.first(F.col("class")).over(win)
)

attend_more_class2 = aggregated_data2.join(
    filtered_students,
    on = "id",
    how = "inner"
)

attend_more_class23 = aggregated_data2.filter(
    F.col("max_class").isin(1,2,3,4)
).withColumn(
    "rank",
    F.rank().over(win2)
).withColumn(
    "attended_more_class",
    F.when(
        F.col("rank")<=1000,
        F.lit("Y")
    )
)

# answer of second part
attend_more_class23.show()