在Multiindex Pandas Dataframe上执行T检验

时间:2015-04-06 19:10:48

标签: python python-2.7 pandas scipy

我正在寻找对pandas DataFrame中的各种数据进行T检验。

我有一个像这样组织的数据框:

df = pd.DataFrame({'a': {('0hr', '0.01um', 0): 12,
      ('0hr', '0.01um', 1): 10,
      ('0hr', '0.1um', 0): 8,
      ('0hr', '0.1um', 1): 6,
      ('0hr', 'Control', 0): 4,
      ('0hr', 'Control', 1): 2,
      ('24hr', '0.01um', 0): 18,
      ('24hr', '0.01um', 1): 15,
      ('24hr', '0.1um', 0): 12,
      ('24hr', '0.1um', 1): 9,
      ('24hr', 'Control', 0): 6,
      ('24hr', 'Control', 1): 3},
     'b': {('0hr', '0.01um', 0): 42,
      ('0hr', '0.01um', 1): 35,
      ('0hr', '0.1um', 0): 28,
      ('0hr', '0.1um', 1): 21,
      ('0hr', 'Control', 0): 14,
      ('0hr', 'Control', 1): 7,
      ('24hr', '0.01um', 0): 30,
      ('24hr', '0.01um', 1): 25,
      ('24hr', '0.1um', 0): 20,
      ('24hr', '0.1um', 1): 15,
      ('24hr', 'Control', 0): 10,
      ('24hr', 'Control', 1): 5}})

打印(DF)

                     a   b
    0hr  0.01um  0  12  42
                 1  10  35
         0.1um   0   8  28
                 1   6  21
         Control 0   4  14
                 1   2   7
    24hr 0.01um  0  18  30
                 1  15  25
         0.1um   0  12  20
                 1   9  15
         Control 0   6  10
                 1   3   5

对于每一列(a,b等),我想计算执行t检验,将给定时间范围内的Control与该时间范围内的其他测试进行比较。

例如:

[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['Control'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['0.01um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['0.1um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['Control'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['0.01um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['0.1um'], 1, equal_var=True)

我一直在尝试用df.apply做这个,但我不确定正确的语法是什么。我想将结果返回到一个新的数据框,结构如下:

results = pd.DataFrame({'a': {('0hr', '0.01um', 't'): '-',
  ('0hr', '0.01um', 'prob'): '-',
  ('0hr', '0.1um', 't'): '-',
  ('0hr', '0.1um', 'prob'): '-',
  ('0hr', 'Control', 't'): '-',
  ('0hr', 'Control', 'prob'): '-',
  ('24hr', '0.01um', 't'): '-',
  ('24hr', '0.01um', 'prob'): '-',
  ('24hr', '0.1um', 't'): '-',
  ('24hr', '0.1um', 'prob'): '-',
  ('24hr', 'Control', 't'): '-',
  ('24hr', 'Control', 'prob'): '-'},
 'b': {('0hr', '0.01um', 't'): '-',
  ('0hr', '0.01um', 'prob'): '-',
  ('0hr', '0.1um', 't'): '-',
  ('0hr', '0.1um', 'prob'): '-',
  ('0hr', 'Control', 't'): '-',
  ('0hr', 'Control', 'prob'): '-',
  ('24hr', '0.01um', 't'): '-',
  ('24hr', '0.01um', 'prob'): '-',
  ('24hr', '0.1um', 't'): '-',
  ('24hr', '0.1um', 'prob'): '-',
  ('24hr', 'Control', 't'): '-',
  ('24hr', 'Control', 'prob'): '-'}})

1 个答案:

答案 0 :(得分:0)

好的,不完全确定我已经理解了这种情况,但我认为这将是处理MultiIndex的方法。

In [195]:

index = pd.MultiIndex.from_product([set(df.index.get_level_values(0)), set(df.index.get_level_values(1)), ['t', 'p']])
result = pd.DataFrame(columns=['a', 'b'], index=index)

for time in set(df.index.get_level_values(0)):
    for condition in set(df.index.get_level_values(1)) - set(['Control']):
        t, p = stats.ttest_ind( df.loc[time].loc['Control'] , df.loc[time].loc[condition], 1, equal_var=True)
        result.loc[(time, condition, 't')] = t
        result.loc[(time, condition, 'p')] = p
print result

结果:

                        a           b
0hr  Control t        NaN         NaN
             p        NaN         NaN
     0.01um  t -0.6706134   -1.412036
             p  0.5715365   0.2934382
     0.1um   t -0.8049845    -1.13842
             p  0.5053153   0.3729403
24hr Control t        NaN         NaN
             p        NaN         NaN
     0.01um  t  -2.529822   -3.137858
             p  0.1271284  0.08831539
     0.1um   t  -1.788854   -2.529822
             p  0.2155355   0.1271284

如果需要,您可以轻松填写​​控制线,但正如您所说,结果是可预测的。

希望它无论如何都有帮助。