考虑这个数据框:
STUDENT T_1 T_2 T_3 T_4
0 A PASS FAIL PASS FAIL
1 B PASS FAIL FAIL FAIL
2 C FAIL FAIL PASS PASS
3 D PASS FAIL PASS PASS
列T_1 - > T_4代表测试。在这种情况下,T_1和T_3是类型'X'的测试,T_2和T_4是类型'Y'的测试。列是分类值。我希望每个测试类型获得%分布(即X / Y)。所以我想要这个:
STATUS X Y
0 PASS 0.75 (6/8) 0.25 (2/8)
1 FAIL 0.25 (2/8) 0.75 (6/8)
我知道我可以在一个系列中使用s.value_counts()/ s.count()来获取每列的%状态分布,但是如何聚合多列(即组合T_1 / T_3,T_2 / T_4因为我知道它们属于特定的测试类型)
答案 0 :(得分:1)
这是一种方法。
import pandas as pd
import numpy as np
# just try to simulate your data
student_id = np.array('A B C D E F G H I G'.split()).reshape(10, 1)
test_results = np.random.choice(['PASS', 'FAIL'], size=(10, 4), p=[0.7, 0.3])
data = np.concatenate([student_id, test_results], axis=1)
df = pd.DataFrame(data, columns=['STUDENT', 'T_1', 'T_2', 'T_3', 'T_4'])
# set index as student names
df.set_index('STUDENT', inplace=True)
# add multi-level index to columns
df.columns = pd.MultiIndex.from_tuples([('T_1', 'X'), ('T_2', 'Y'), ('T_3', 'X'), ('T_4', 'Y')])
# transpose the df, groupby X,Y
by = df.T.groupby(level=1)
def count_func(group):
num_pass = (group.values == 'PASS').sum()
num_fail = (group.values == 'FAIL').sum()
pass_rate = '{:>3.2f}% ({}/{})'.format(num_pass/(num_pass + num_fail), num_pass, num_pass + num_fail)
fail_rate = '{:>3.2f}% ({}/{})'.format(num_fail/(num_pass + num_fail), num_fail, num_pass + num_fail)
return pd.Series({'PASS RATE': pass_rate, 'FAIL_RATE': fail_rate})
result = by.apply(count_func)
Out[5]:
FAIL_RATE PASS RATE
X 0.25% (5/20) 0.75% (15/20)
Y 0.25% (5/20) 0.75% (15/20)