我想对一组计分的学生进行排名,以确保是否选了N个学生 从列表的顶部开始,我将至少获得某个类别中的一部分。
所以如果我们有这个数据框的输入
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| A|female|100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| E|female| 66.0|
| F|female| 55.0|
| G|female| 44.0|
| H|female| 33.0|
| I| male| 22.0|
| J| male| 11.0|
+----------+------+-----+
我们的目标是在任何时候使我的人口中有0.2%成为男性,我会像这样对它进行排名
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| I| male| 22.0|
| A|female|100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| J| male| 11.0|
| E|female| 66.0|
| F|female| 55.0|
| G|female| 44.0|
| H|female| 33.0|
+----------+------+-----+
现在,如果我从人口中选出排名前1、2、3、4、5 ... 10名的学生,那我保证我能达到目标 .2雄性比例,但仍按最佳与最差的顺序排列。
即使我的雌性稍有下调,我仍要确保以最好到最坏的顺序来对待它们。
还有更多示例。
输入
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| A|female|100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| E|female| 66.0|
| F|female| 55.0|
| G|female| 44.0|
| H|female| 33.0|
| I| male| 22.0|
| J| male| 11.0|
+----------+------+-----+
100%的输出应为男性,因此所有输出都移至顶部
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| I| male| 22.0|
| J| male| 11.0|
| A|female|100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| E|female| 66.0|
| F|female| 55.0|
| G|female| 44.0|
| H|female| 33.0|
+----------+------+-----+
输入
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| A|male |100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| E|female| 66.0|
| F|female| 55.0|
| G|female| 44.0|
| H|female| 33.0|
| I| male| 22.0|
| J| male| 11.0|
+----------+------+-----+
20%的输出应该是男性,但是已经有一个,所以我们只需要移动1
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| A|male |100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| E|female| 66.0|
| I| male| 22.0|
| F|female| 55.0|
| G|female| 44.0|
| H|female| 33.0|
| J| male| 11.0|
+----------+------+-----+
这是适用于某些情况的代码,但不适用于其他情况。
获取输入数据框,对其进行排名,按类型对其进行排名,然后根据所需比例调整排名。
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType
import pyspark.sql.functions as f
temp_struct = StructType([
StructField('STUDENT_ID', StringType()),
StructField('TYPE', StringType()),
StructField('SCORE', DoubleType())
])
temp_df = spark.createDataFrame([
['A', 'female', 100.0],
['B', 'female', 99.0],
['C', 'female', 88.0],
['D', 'female', 77.0],
['E', 'female', 66.0],
['F', 'female', 55.0],
['G', 'female', 44.0],
['H', 'female', 33.0],
['I', 'male', 22.0],
['J', 'male', 11.0]
], temp_struct)
print('Initial DF')
temp_df.show()
window_by_score_desc = Window.orderBy(f.col('SCORE').desc())
temp_df = temp_df.withColumn('RANK', f.row_number().over(window_by_score_desc)).orderBy(f.col('RANK').asc())
print('With RANK DF')
temp_df.show()
window_by_type_rank = Window.partitionBy(f.col('TYPE')).orderBy(f.col('RANK').asc())
temp_df = temp_df.withColumn('TYPE_RANK', f.row_number().over(window_by_type_rank)).orderBy(f.col('RANK').asc())
print('With TYPE RANK DF')
temp_df.show()
def weight_for_type_and_ratio(input_df, student_type, student_ratio):
section_size = float(1 / student_ratio)
return input_df.withColumn('ADJUSTED_RANK',
f.when(f.col('TYPE') == student_type,
(f.col('TYPE_RANK') - 1) * (section_size-1) + .5).otherwise(f.col('RANK')))
print('FINAL WITH ADJUSTED RANK DF')
weight_for_type_and_ratio(temp_df, 'male', .2).orderBy(f.col('ADJUSTED_RANK').asc()).show()
此代码在某些情况下适用... 输入:
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| A|female|100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| E|female| 66.0|
| F|female| 55.0|
| G|female| 44.0|
| H|female| 33.0|
| I| male| 22.0|
| J| male| 11.0|
+----------+------+-----+
哪个可以正确调整排名输出
+----------+------+-----+----+---------+-------------+
|STUDENT_ID| TYPE|SCORE|RANK|TYPE_RANK|ADJUSTED_RANK|
+----------+------+-----+----+---------+-------------+
| I| male| 22.0| 9| 1| 0.5|
| A|female|100.0| 1| 1| 1.0|
| B|female| 99.0| 2| 2| 2.0|
| C|female| 88.0| 3| 3| 3.0|
| D|female| 77.0| 4| 4| 4.0|
| J| male| 11.0| 10| 2| 4.5|
| E|female| 66.0| 5| 5| 5.0|
| F|female| 55.0| 6| 6| 6.0|
| G|female| 44.0| 7| 7| 7.0|
| H|female| 33.0| 8| 8| 8.0|
+----------+------+-----+----+---------+-------------+
但不适用于其他情况,特别是当某些记录已经存在并且不需要调整时。
输入DF: 初始DF
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| A| male|100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| E|female| 66.0|
| F|female| 55.0|
| G|female| 44.0|
| H|female| 33.0|
| I| male| 22.0|
| J| male| 11.0|
+----------+------+-----+
哪个给出了错误的输出:
+----------+------+-----+----+---------+-------------+
|STUDENT_ID| TYPE|SCORE|RANK|TYPE_RANK|ADJUSTED_RANK|
+----------+------+-----+----+---------+-------------+
| A| male|100.0| 1| 1| 0.5|
| B|female| 99.0| 2| 1| 2.0|
| C|female| 88.0| 3| 2| 3.0|
| D|female| 77.0| 4| 3| 4.0|
| I| male| 22.0| 9| 2| 4.5|
| E|female| 66.0| 5| 4| 5.0|
| F|female| 55.0| 6| 5| 6.0|
| G|female| 44.0| 7| 6| 7.0|
| H|female| 33.0| 8| 7| 8.0|
| J| male| 11.0| 10| 3| 8.5|
+----------+------+-----+----+---------+-------------+
男性I的调整等级过高的地方。
任何对解决此问题的不同方法的想法。不需要太多的代码更改,可能只是考虑思路不同。
答案 0 :(得分:0)
如果只想确保当您招收N名学生时,您会得到某个类别的某个部分,那么我认为使用limit可以找到更清晰的解决方案。看看下面的代码:
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType
import pyspark.sql.functions as F
temp_struct = StructType([
StructField('STUDENT_ID', StringType()),
StructField('TYPE', StringType()),
StructField('SCORE', DoubleType())
])
temp_df = spark.createDataFrame([
['A', 'female', 100.0],
['B', 'female', 99.0],
['C', 'female', 88.0],
['D', 'female', 77.0],
['E', 'female', 66.0],
['F', 'female', 55.0],
['G', 'female', 44.0],
['H', 'female', 33.0],
['I', 'male', 22.0],
['J', 'male', 11.0]
], temp_struct)
#Total number of students you want to get
total = 5
#portion of the category
fractionMale = 0.2
#simply selecting and limiting the rows for each category and using a union to get a single dataframe
temp_df.filter(temp_df.TYPE == 'male').limit(int(total * fractionMale)).union(temp_df.filter(temp_df.TYPE == 'female').limit(int(total * (1-fractionMale)))).show()
输出:
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| I| male| 22.0|
| A|female|100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
+----------+------+-----+
不幸的是,我们不能使用sampleby,因为Spark使用Bernoulli_sampling,并且可以肯定的是,您可以得到每个类别的预期总数。即以下内容将不会总是返回5行带有预期分数的行。
total = 5
fractionMale = 0.2
countMale = temp_df.filter(temp_df.TYPE == 'male').count()
countFemale = temp_df.count() - countMale
sampleFractionMale = (total * fractionMale)/countMale
sampleFractionFemale = (total * (1 - fractionMale))/countFemale
temp_df.sampleBy("TYPE", fractions={'male': sampleFractionMale, 'female': sampleFractionFemale}).show()
输出:
+----------+------+-----+
|STUDENT_ID| TYPE|SCORE|
+----------+------+-----+
| A|female|100.0|
| B|female| 99.0|
| C|female| 88.0|
| D|female| 77.0|
| F|female| 55.0|
| I| male| 22.0|
+----------+------+-----+