我正在寻找对pandas DataFrame中的各种数据进行T检验。
我有一个像这样组织的数据框:
df = pd.DataFrame({'a': {('0hr', '0.01um', 0): 12,
('0hr', '0.01um', 1): 10,
('0hr', '0.1um', 0): 8,
('0hr', '0.1um', 1): 6,
('0hr', 'Control', 0): 4,
('0hr', 'Control', 1): 2,
('24hr', '0.01um', 0): 18,
('24hr', '0.01um', 1): 15,
('24hr', '0.1um', 0): 12,
('24hr', '0.1um', 1): 9,
('24hr', 'Control', 0): 6,
('24hr', 'Control', 1): 3},
'b': {('0hr', '0.01um', 0): 42,
('0hr', '0.01um', 1): 35,
('0hr', '0.1um', 0): 28,
('0hr', '0.1um', 1): 21,
('0hr', 'Control', 0): 14,
('0hr', 'Control', 1): 7,
('24hr', '0.01um', 0): 30,
('24hr', '0.01um', 1): 25,
('24hr', '0.1um', 0): 20,
('24hr', '0.1um', 1): 15,
('24hr', 'Control', 0): 10,
('24hr', 'Control', 1): 5}})
打印(DF)
a b
0hr 0.01um 0 12 42
1 10 35
0.1um 0 8 28
1 6 21
Control 0 4 14
1 2 7
24hr 0.01um 0 18 30
1 15 25
0.1um 0 12 20
1 9 15
Control 0 6 10
1 3 5
对于每一列(a,b等),我想计算执行t检验,将给定时间范围内的Control与该时间范围内的其他测试进行比较。
例如:
[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['Control'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['0.01um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['0.1um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['Control'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['0.01um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['0.1um'], 1, equal_var=True)
我一直在尝试用df.apply做这个,但我不确定正确的语法是什么。我想将结果返回到一个新的数据框,结构如下:
results = pd.DataFrame({'a': {('0hr', '0.01um', 't'): '-',
('0hr', '0.01um', 'prob'): '-',
('0hr', '0.1um', 't'): '-',
('0hr', '0.1um', 'prob'): '-',
('0hr', 'Control', 't'): '-',
('0hr', 'Control', 'prob'): '-',
('24hr', '0.01um', 't'): '-',
('24hr', '0.01um', 'prob'): '-',
('24hr', '0.1um', 't'): '-',
('24hr', '0.1um', 'prob'): '-',
('24hr', 'Control', 't'): '-',
('24hr', 'Control', 'prob'): '-'},
'b': {('0hr', '0.01um', 't'): '-',
('0hr', '0.01um', 'prob'): '-',
('0hr', '0.1um', 't'): '-',
('0hr', '0.1um', 'prob'): '-',
('0hr', 'Control', 't'): '-',
('0hr', 'Control', 'prob'): '-',
('24hr', '0.01um', 't'): '-',
('24hr', '0.01um', 'prob'): '-',
('24hr', '0.1um', 't'): '-',
('24hr', '0.1um', 'prob'): '-',
('24hr', 'Control', 't'): '-',
('24hr', 'Control', 'prob'): '-'}})
答案 0 :(得分:0)
好的,不完全确定我已经理解了这种情况,但我认为这将是处理MultiIndex的方法。
In [195]:
index = pd.MultiIndex.from_product([set(df.index.get_level_values(0)), set(df.index.get_level_values(1)), ['t', 'p']])
result = pd.DataFrame(columns=['a', 'b'], index=index)
for time in set(df.index.get_level_values(0)):
for condition in set(df.index.get_level_values(1)) - set(['Control']):
t, p = stats.ttest_ind( df.loc[time].loc['Control'] , df.loc[time].loc[condition], 1, equal_var=True)
result.loc[(time, condition, 't')] = t
result.loc[(time, condition, 'p')] = p
print result
结果:
a b
0hr Control t NaN NaN
p NaN NaN
0.01um t -0.6706134 -1.412036
p 0.5715365 0.2934382
0.1um t -0.8049845 -1.13842
p 0.5053153 0.3729403
24hr Control t NaN NaN
p NaN NaN
0.01um t -2.529822 -3.137858
p 0.1271284 0.08831539
0.1um t -1.788854 -2.529822
p 0.2155355 0.1271284
如果需要,您可以轻松填写控制线,但正如您所说,结果是可预测的。
希望它无论如何都有帮助。