Question

我有一些函数可以在pandas数据框中创建新列，作为数据框中现有列的函数。我在这里发生了两种不同的场景：（1）数据帧不是multiIndex并且有一组列，比如[a，b]和（2）数据帧是multiIndex，现在有相同的列标题重复N次，[（a，1），（b，1），（a，2），（b，2）......（a，N），（n，N）]。

我已经按照下面显示的样式制作了上述功能：

def f(df):
    if multiindex(df):
        for s df[a].columns:
            df[c,s] = someFunction(df[a,s], df[b,s])
    else:
        df[c] = someFunction(df[a], df[b])

有没有其他方法可以做到这一点，没有这些if-multi-index / else语句到处并重复someFunction代码？我不希望不将多索引帧拆分为N个较小的数据帧（我经常需要过滤数据或做事情并保持所有1,2，... N帧的行保持一致，并将它们保持在一起一帧似乎是最好的方式）。

Answer 1

您可能仍需要测试列是否为MultiIndex，但这应该更干净，更高效。警告，如果您的函数使用列的汇总统计信息，这将无法工作。例如，如果someFunction除以列的平均值＆＃39; a＆＃39;。

解决方案

def someFunction(a, b):
    return a + b

def f(df):
    df = df.copy()
    ismi = isinstance(df.columns, pd.MultiIndex)
    if ismi:
        df = df.stack()

    df['c'] = someFunction(df['a'], df['a'])

    if ismi:
        df = df.unstack()

    return df

设置

import pandas as pd
import numpy as np

setup_tuples = []

for c in ['a', 'b']:
        for i in ['one', 'two', 'three']:
            setup_tuples.append((c, i))

columns = pd.MultiIndex.from_tuples(setup_tuples)

rand_array = np.random.rand(10, len(setup_tuples))

df = pd.DataFrame(rand_array, columns=columns)

df看起来像这样

          a                             b                    
        one       two     three       one       two     three
0  0.282834  0.490313  0.201300  0.140157  0.467710  0.352555
1  0.838527  0.707131  0.763369  0.265170  0.452397  0.968125
2  0.822786  0.785226  0.434637  0.146397  0.056220  0.003197
3  0.314795  0.414096  0.230474  0.595133  0.060608  0.900934
4  0.334733  0.118689  0.054299  0.237786  0.658538  0.057256
5  0.993753  0.552942  0.665615  0.336948  0.788817  0.320329
6  0.310809  0.199921  0.158675  0.059406  0.801491  0.134779
7  0.971043  0.183953  0.723950  0.909778  0.103679  0.695661
8  0.755384  0.728327  0.029720  0.408389  0.808295  0.677195
9  0.276158  0.978232  0.623972  0.897015  0.253178  0.093772

我构建了df以获得MultiIndex列。我要做的是使用.stack()方法将列索引的第二级推送到行索引的第二级。

df.stack()看起来像这样

                a         b
0 one    0.282834  0.140157
  three  0.201300  0.352555
  two    0.490313  0.467710
1 one    0.838527  0.265170
  three  0.763369  0.968125
  two    0.707131  0.452397
2 one    0.822786  0.146397
  three  0.434637  0.003197
  two    0.785226  0.056220
3 one    0.314795  0.595133
  three  0.230474  0.900934
  two    0.414096  0.060608
4 one    0.334733  0.237786
  three  0.054299  0.057256
  two    0.118689  0.658538
5 one    0.993753  0.336948
  three  0.665615  0.320329
  two    0.552942  0.788817
6 one    0.310809  0.059406
  three  0.158675  0.134779
  two    0.199921  0.801491
7 one    0.971043  0.909778
  three  0.723950  0.695661
  two    0.183953  0.103679
8 one    0.755384  0.408389
  three  0.029720  0.677195
  two    0.728327  0.808295
9 one    0.276158  0.897015
  three  0.623972  0.093772
  two    0.978232  0.253178

现在您可以对df.stack（）进行操作，就像列不是MultiIndex

一样

示范

print f(df)

会给你你想要的东西

          a                             b                             c  \
        one     three       two       one     three       two       one   
0  0.282834  0.201300  0.490313  0.140157  0.352555  0.467710  0.565667   
1  0.838527  0.763369  0.707131  0.265170  0.968125  0.452397  1.677055   
2  0.822786  0.434637  0.785226  0.146397  0.003197  0.056220  1.645572   
3  0.314795  0.230474  0.414096  0.595133  0.900934  0.060608  0.629591   
4  0.334733  0.054299  0.118689  0.237786  0.057256  0.658538  0.669465   
5  0.993753  0.665615  0.552942  0.336948  0.320329  0.788817  1.987507   
6  0.310809  0.158675  0.199921  0.059406  0.134779  0.801491  0.621618   
7  0.971043  0.723950  0.183953  0.909778  0.695661  0.103679  1.942086   
8  0.755384  0.029720  0.728327  0.408389  0.677195  0.808295  1.510767   
9  0.276158  0.623972  0.978232  0.897015  0.093772  0.253178  0.552317   


      three       two  
0  0.402600  0.980626  
1  1.526739  1.414262  
2  0.869273  1.570453  
3  0.460948  0.828193  
4  0.108599  0.237377  
5  1.331230  1.105884  
6  0.317349  0.399843  
7  1.447900  0.367907  
8  0.059439  1.456654  
9  1.247944  1.956464

在pandas数据帧上运行，可能是也可能不是multiIndex

1 个答案:

解决方案

设置

示范