我有如下示例表(我有100万这样的行),我需要根据以下条件选择要添加到新数据框的行,
我必须选择参加更多课堂的前1000名学生
与其他班级相比,参加第1、2、3、4次课程的前1000名学生
因此在我的示例中,我需要将学生 123 和 678 的所有行存储到其他数据框中
我没有正确的逻辑
答案 0 :(得分:0)
以下是您解决问题的方法,请告诉我是否有帮助
import pyspark.sql.functions as F
from pyspark.sql import Window
attended_more_classes = df.filter(
F.col("check_in") == "y"
).groupby(
"id"
).agg(
F.countDistinct(F.col("class")).alias("class_count")
)
win = Window.partitionBy("id").orderBy(F.col("class_count").desc())
attended_more_classes = attended_more_classes.withColumn(
"rank",
F.rank().over(win)
).withColumn(
"attended_more_class",
F.when(
F.col("rank")<=1000,
F.lit("Y")
)
)
# result of first part
attended_more_classes.show()
# answer start for second question
win2 = Window.partitionBy("id", "class").orderBy(F.col("class_count").desc())
filtered_students = df.filter(F.col("class").isin(1,2,3,4)).select("id").distinct()
aggregated_data2 = df.filter(
F.col("check_in") == "y"
).groupby(
"id",
"class"
).agg(
F.count(F.col("check_in")).alias("class_count")
).withColumn(
"max_class",
F.first(F.col("class")).over(win)
)
attend_more_class2 = aggregated_data2.join(
filtered_students,
on = "id",
how = "inner"
)
attend_more_class23 = aggregated_data2.filter(
F.col("max_class").isin(1,2,3,4)
).withColumn(
"rank",
F.rank().over(win2)
).withColumn(
"attended_more_class",
F.when(
F.col("rank")<=1000,
F.lit("Y")
)
)
# answer of second part
attend_more_class23.show()