我有一个DataFrame(mydf),如下所示:
Index Feature ID Stuff1 Stuff2
1 True 1 23 12
2 True 1 54 12
3 False 0 45 67
4 True 0 38 29
5 False 1 32 24
6 False 1 59 39
7 True 0 37 32
8 False 0 76 65
9 False 1 32 12
10 True 0 23 15
..n True 1 21 99
我正在尝试计算'功能的真假百分比。对于每个ID' (0或1),我正在为每个ID寻找两个输出:
Feature ID Percent
True 1 20%
False 1 30%
Feature ID Percent
True 0 30%
False 0 20%
我尝试了几次尝试,但我开始计算所有列的计数,然后是所有列的百分比。
这是我的不良尝试:
percentageID0 = mydf[ mydf['ID']==0 ].set_index(['Feature']).count()
percentageID1 = mydf[ mydf['ID']==1 ].set_index(['Feature']).count()
fullcount = (mydf.groupby(['ID']).count()).sum()
print (percentageID0/fullcount) * 100
print (percentageID1/fullcount) * 100
认为我正在混淆groupby / index格式。
答案 0 :(得分:6)
可能就是这样:
In [73]:
print pd.DataFrame({'Percentage': df.groupby(('ID', 'Feature')).size() / len(df)})
Percentage
ID Feature
0 False 0.2
True 0.3
1 False 0.3
True 0.2
答案 1 :(得分:0)
您可以使用pd.crosstab
:
>>> newdf = pd.crosstab(index=mydf['Feature'], columns=mydf['ID']).stack()/len(mydf)
>>> print(newdf)
Feature ID
False 0 0.2
1 0.3
True 0 0.3
1 0.2
dtype: float64
答案 2 :(得分:0)
您也可以使用tableone package。创建示例数据框:
<span class="badge">@ViewBag.count</span>
输入:
# Create df with 10 rows.
df = pd.DataFrame({'Feature': [True,True,False,True,False,False,True,False,False,True],
'ID': [1,1,0,0,1,1,0,0,1,0],
'Stuff1': [23,54,45,38,32,59,37,76,32,23],
'Stuff2': [12,12,67,29,24,39,32,65,12,15]})
输出:
答案 3 :(得分:0)
In [2]: df = pd.DataFrame({'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
...: 'Feature': {0: True, 1: True, 2: False, 3: True, 4: False, 5: False, 6: True, 7: False, 8: False, 9: True},
...: 'ID': {0: 1, 1: 1, 2: 0, 3: 0, 4: 1, 5: 1, 6: 0, 7: 0, 8: 1, 9: 0},
...: 'Stuff1': {0: 23, 1: 54, 2: 45, 3: 38, 4: 32, 5: 59, 6: 37, 7: 76, 8: 32, 9: 23},
...: 'Stuff2': {0: 12, 1: 12, 2: 67, 3: 29, 4: 24, 5: 39, 6: 32, 7: 65, 8: 12, 9: 15}}).sort_values(["ID", "Feature"])
...: df
Out[2]:
Index Feature ID Stuff1 Stuff2
2 3 False 0 45 67
7 8 False 0 76 65
3 4 True 0 38 29
6 7 True 0 37 32
9 10 True 0 23 15
4 5 False 1 32 24
5 6 False 1 59 39
8 9 False 1 32 12
0 1 True 1 23 12
1 2 True 1 54 12
In [3]: f = df.drop_duplicates(subset=['Feature', 'ID'])
...: f2 = (df.groupby(["Feature", "ID"]).agg('count')/len(df)*100).iloc[:, 0].reset_index().rename(columns={"Index" : "Percent"})
...: f2['Percent'] = f2['Percent'].astype(int).astype(str) + "%"
...: f2
Out[3]:
Feature ID Percent
0 False 0 20%
1 False 1 30%
2 True 0 30%
3 True 1 20%