熊猫数据框数学

时间:2015-04-04 01:49:55

标签: python python-2.7 pandas

我有一个像这样结构的pandas DataFrame有3个索引级别:

                        a    b
    0hr    0.01um   0   12   42
                    1   10   35
           0.1um    0   8    28
                    1   6    21
          Control   0   4    14
                    1   2    7
   24hr    0.01um   0   18   30
                    1   15   25
           0.1um    0   12   20
                    1   9    15
          Control   0   6    10
                    1   3    5

DataFrame是从一系列excel文件导入的。抱歉,我无法提供一段代码来生成这个3级深度索引结构,因为我不知道如何直接生成它。

我正在寻找通过各自" Control"来划分每个值的语法。

例如

                        a       b
    0hr    0.01um   0   =12/4   =42/14
                    1   =10/2   =35/7
           0.1um    0   =8/4    =28/14
                    1   =6/2    =21/7
          Control   0   =4/4    =14/14
                    1   =2/2    =7/7
   24hr    0.01um   0   =18/6   =30/10
                    1   =15/3   =25/5
           0.1um    0   =12/6   =20/10
                    1   =9/3    =15/5
          Control   0   =6/6    =10/10
                    1   =3/3    =5/5

将生成具有以下值的数据框:

                        a    b
    0hr    0.01um   0   3    3
                    1   5    5
           0.1um    0   2    2
                    1   3    3
          Control   0   1    1
                    1   1    1
   24hr    0.01um   0   3    3
                    1   5    5
           0.1um    0   2    2
                    1   3    3
          Control   0   1    1
                    1   1    1 

我尝试用循环执行此操作,但我认为DataFrame.div方法可能有更好的语法,但我无法弄明白。任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:2)

人们希望能够只定义控件并使用它来划分数据库,但不幸的是,这不能按预期工作。它只划分索引排队的位置(在'Control'上),将NaN留在其他索引级别上。

# Initialize DataFrame
df = pd.DataFrame({'a': {('0hr', '0.01um', 0): 12,
  ('0hr', '0.01um', 1): 10,
  ('0hr', '0.1um', 0): 8,
  ('0hr', '0.1um', 1): 6,
  ('0hr', 'Control', 0): 4,
  ('0hr', 'Control', 1): 2,
  ('24hr', '0.01um', 0): 18,
  ('24hr', '0.01um', 1): 15,
  ('24hr', '0.1um', 0): 12,
  ('24hr', '0.1um', 1): 9,
  ('24hr', 'Control', 0): 6,
  ('24hr', 'Control', 1): 3},
 'b': {('0hr', '0.01um', 0): 42,
  ('0hr', '0.01um', 1): 35,
  ('0hr', '0.1um', 0): 28,
  ('0hr', '0.1um', 1): 21,
  ('0hr', 'Control', 0): 14,
  ('0hr', 'Control', 1): 7,
  ('24hr', '0.01um', 0): 30,
  ('24hr', '0.01um', 1): 25,
  ('24hr', '0.1um', 0): 20,
  ('24hr', '0.1um', 1): 15,
  ('24hr', 'Control', 0): 10,
  ('24hr', 'Control', 1): 5}})

control = df.xs('Control', level=1)

>>> control 
                a   b
0hr  Control 0  4  14
             1  2   7
24hr Control 0  6  10
             1  3   5

>>> df.divide(control) 
                 a   b
0hr  0.01um  0 NaN NaN
             1 NaN NaN
     0.1um   0 NaN NaN
             1 NaN NaN
     Control 0   1   1
             1   1   1
24hr 0.01um  0 NaN NaN
             1 NaN NaN
     0.1um   0 NaN NaN
             1 NaN NaN
     Control 0   1   1
             1   1   1

或者,可以尝试在进行除法时指定级别。但是,这种方法的问题在于此操作会引发错误,因为两个操作数仍然是MultiIndex对象。抛出错误是因为如果级别可能以多种方式匹配,则可能存在歧义。

>>> df.divide(control, level=1)
TypeError: Join on level between two MultiIndex objects is ambiguous

诀窍是重塑您的DataFrame以避免这种歧义。

# Reshape DataFrame.
df2 = df.T.stack(level=[0, 1])
>>> df2
          0.01um  0.1um  Control
a 0hr  0      12      8        4
       1      10      6        2
  24hr 0      18     12        6
       1      15      9        3
b 0hr  0      42     28       14
       1      35     21        7
  24hr 0      30     20       10
       1      25     15        5


# Divide reshaped DataFrame by 'Control' on the appropriate axis.
df3 = df2.divide(df2.Control, axis=0)
>>> df3
          0.01um  0.1um  Control
a 0hr  0       3      2        1
       1       5      3        1
  24hr 0       3      2        1
       1       5      3        1
b 0hr  0       3      2        1
       1       5      3        1
  24hr 0       3      2        1
       1       5      3        1

然后,您需要将DataFrame重新整形为原始格式。

# Shape DataFrame back to original order.
result = df3.T.unstack().reorder_levels([1, 3, 2, 0]).unstack()

>>> result
                a  b
0hr  0.01um  0  3  3
             1  5  5
     0.1um   0  2  2
             1  3  3
     Control 0  1  1
             1  1  1
24hr 0.01um  0  3  3
             1  5  5
     0.1um   0  2  2
             1  3  3
     Control 0  1  1
             1  1  1

答案 1 :(得分:1)

好的,这就是我得到的。比我更喜欢的步骤,但它的工作原理。希望有人想出更好的东西

从你的框架开始

                a   b
0hr 0.01um  0   12  42
            1   10  35
    0.1um   0   8   28
            1   6   21
   Control  0   4   14
            1   2   7
24hr 0.01um 0   18  30
            1   15  25
     0.1um  0   12  20
            1   9   15
    Control 0   6   10
            1   3   5

首先我们重置索引。请注意前一个索引的列名。你可能会有所不同。

frame.reset_index(inplace=True)
frame

    level_0 level_1 level_2 a   b
0   0hr     0.01um  0      12   42
1   0hr     0.01um  1      10   35
2   0hr     0.1um   0      8    28
3   0hr     0.1um   1      6    21
4   0hr     Control 0      4    14
5   0hr     Control 1      2    7
6   24hr    0.01um  0     18    30
7   24hr    0.01um  1     15    25
8   24hr    0.1um   0     12    20
9   24hr    0.1um   1      9    15
10  24hr    Control 0      6    10
11  24hr    Control 1      3    5

接下来,我们使用布尔索引过滤标记为Control的所有内容。然后,我们merge使用我们的原始版本“过滤”版本。

  filter = frame["level_1"] == "Control"
  frame = pd.merge(frame,frame[filter],on=["level_0","level_2"],suffixes=["","_control"])
  frame

    level_0 level_1 level_2 a   b   level_1_control a_control   b_control
0   0hr     0.01um  0      12   42  Control         4          14
1   0hr     0.1um   0      8    28  Control         4          14
2   0hr     Control 0      4    14  Control         4          14
3   0hr     0.01um  1      10   35  Control         2          7
4   0hr     0.1um   1      6    21  Control         2          7
5   0hr     Control 1      2    7   Control         2          7
6   24hr    0.01um  0     18    30  Control         6          10
7   24hr    0.1um   0     12    20  Control         6          10
8   24hr    Control 0     6     10  Control         6          10
9   24hr    0.01um  1     15    25  Control         3          5
10  24hr    0.1um   1      9    15  Control         3          5
11  24hr    Control 1      3    5   Control         3          5

现在这个师......最后......在最后一行继续进行。减小数据框的大小,排序并重新应用索引以匹配原始框架

frame["a"] = frame["a"] / frame["a_control"]
frame["b"] = frame["b"] / frame["b_control"]
frame = frame[["level_0","level_1","level_2","a","b"]].sort(["level_0","level_1","level_2"]).set_index(["level_0","level_1","level_2"])
frame

                         a  b
level_0 level_1 level_2     
0hr     0.01um  0        3  3
                1        5  5
        0.1um   0        2  2
                1        3  3
        Control 0        1  1
                1        1  1
24hr    0.01um  0        3  3
                1        5  5
        0.1um   0        2  2
                1        3  3
        Control 0        1  1
                1        1  1
相关问题