我有以下数据集:
HID Score Decile_Name Result
2089 62 4th decile 1
897 47 2nd decile 0
85 55 3rd decile 0
8 74 7th decile 1
23 31 1st decile 1
5657 77 8th decile 1
52 85 9th decile 0
781 63 6th decile 0
565 42 1st decile 0
456 62 4th decile 1
12 89 10th decile 1
56 85 9th decile 1
#Create a DataFrame
df1 = {
'HID':[2089,897,85,8,23,5657,52,781,565,456,12,56],
'Score':[62,74,31,77,85,63,42,62,89,85],
'Decile_Name':['4th decile','7th decile','1st decile','8th decile','9th decile','6th decile','1st decile','4th decile','10th decile','9th decile'],
'Result' :[1,1,1,1,0,0,0,1,1,1]
]}
df1 = pd.DataFrame(df1,columns=['HID','Score','Decile_Name','Result'])
这将捕获每个学生的主题得分和相应分数的十分位。它还可以记录学生是否通过或失败(结果)
我想计算每个十分位数(Result%)和整体(在整个数据集中)中Result = 1的比例。预期输出:
Attribute Level Result % num_of_stu
Score - All Categories 0.5 12 # This captures the values for the whole df(df1).
Score - 1st Decile 0.5 2
Score - 2nd Decile 0 1
Score - 3rd Decile 0 1
...
Score - 9th Decile 0.5 2
Score - 10th Decile 1 1
有人可以帮我吗?
答案 0 :(得分:1)
如果0
列中只有1
和Result
值的解决方案:
首先通过agg
进行聚合,然后使用extract
将argsort
的索引值按整数排序,创建新的摘要DataFrame并对其进行append
:
df1 = df.groupby('Decile_Name').agg({'Result':'mean', 'HID':'size'})
df1 = df1.iloc[df1.index.str.extract('(\d+)', expand=False).astype(int).argsort()]
df2 = pd.DataFrame({'Result': [df['Result'].mean()],
'HID': [len(df)]}, index=['All Categories'])
d = {'Result':'Result %','HID':'num_of_stu'}
df1 = df2.append(df1).rename(columns=d)
print (df1)
Result % num_of_stu
All Categories 0.583333 12
1st decile 0.500000 2
2nd decile 0.000000 1
3rd decile 0.000000 1
4th decile 1.000000 2
6th decile 0.000000 1
7th decile 1.000000 1
8th decile 1.000000 1
9th decile 0.500000 2
10th decile 1.000000 1
一般解决方案-仅为1
值创建boolena蒙版:
df['Result1'] = df['Result'] == 1
df1 = df.groupby('Decile_Name').agg({'Result1':'mean', 'HID':'size'})
df1 = df1.iloc[df1.index.str.extract('(\d+)', expand=False).astype(int).argsort()]
df2 = pd.DataFrame({'Result1': [df['Result1'].mean()],
'HID': [len(df)]}, index=['All Categories'])
d = {'Result1':'Result %','HID':'num_of_stu'}
df1 = df2.append(df1).rename(columns=d)
print (df1)
Result % num_of_stu
All Categories 0.583333 12
1st decile 0.500000 2
2nd decile 0.000000 1
3rd decile 0.000000 1
4th decile 1.000000 2
6th decile 0.000000 1
7th decile 1.000000 1
8th decile 1.000000 1
9th decile 0.500000 2
10th decile 1.000000 1
答案 1 :(得分:0)
#build mean of Results grouped by Decile Name
result_df = df1[['Decile_Name','Result']].groupby(['Decile_Name']).mean()
#build count of Students grouped by Decile Name
students_df = df1[['Decile_Name','HID']].groupby(['Decile_Name']).count()
#merge the two dataframes
merged_df = pd.concat([result_df, students_df], axis=1)
#Add the sum for all studends as Index "All Students"
merged_df.loc["All Studends"] = [df1[['Result']].mean()["Result"], df1[['HID']].count()["HID"]]
#print
print(merged_df)
结果:
Result HID
Decile_Name
10th decile 1.000000 1.0
1st decile 0.500000 2.0
2nd decile 0.000000 1.0
3rd decile 0.000000 1.0
4th decile 1.000000 2.0
6th decile 0.000000 1.0
7th decile 1.000000 1.0
8th decile 1.000000 1.0
9th decile 0.500000 2.0
All Studends 0.583333 12.0