Question

我有一些匹配列的DataFrame，但有不同的索引（MultiIndex，相同的级别，不同的值）。

cols = ['foo', 'bar', 'baz']
df0 = pd.DataFrame(np.random.rand(3,3), columns=cols, index=['a', 'c', 'd'])
df1 = pd.DataFrame(np.random.rand(3,3), columns=cols, index=['a', 'b', 'd'])

         foo         bar         baz
a   0.145753    0.305494    0.847635
c   0.511722    0.009868    0.053700
d   0.094677    0.935106    0.506444

         foo         bar         baz
a   0.667486    0.529557    0.733383
b   0.883774    0.420490    0.287766
d   0.406956    0.165573    0.546746

每个DataFrame代表一个实验;从图像处理流水线中提取的数据，其中特定的软件参数已经变化。列是相同的，因为我总是提取相同的指标，但索引是不同的，因为我可能已经通过管道推送不同的图像。

我经常发现自己正在合并这些DataFrame：

def merge_experiments(frames, names, exp_name='tag'):
    """Merge DataFrames on new level of columns"""

    prepared = []
    for df, name in zip(frames, names):
        _df = df.copy()
        _df[exp_name] = name

        _df = _df.set_index(exp_name, append=True)
        prepared.append(_df)

    return pd.concat(prepared).unstack(level=exp_name)

df = merge_experiments((df0, df1), ('exp00', 'exp01'))

          foo                 bar                 baz          
tag     exp00     exp01     exp00     exp01     exp00     exp01
a    0.590941  0.517771  0.190399  0.742759  0.884761  0.740587
b         NaN  0.973151       NaN  0.287167       NaN  0.505956
c    0.867419       NaN  0.357269       NaN  0.641385       NaN
d    0.676436  0.065348  0.820161  0.639484  0.005347  0.541025

有没有一种内置的方法可以在Pandas中执行此操作，而不是为了合并而使用这个自定义函数？

Answer 1

是的，有一个concat

(pd.concat([df0,df1],keys=['exp00', 'exp01'],axis=1)).swaplevel(0,1,axis=1).sort_index(axis=1)
Out[572]: 
        bar                 baz                 foo          
      exp00     exp01     exp00     exp01     exp00     exp01
a  0.166814  0.192251  0.804820  0.177737  0.407284  0.343585
b       NaN  0.305210       NaN  0.895246       NaN  0.670265
c  0.841093       NaN  0.710769       NaN  0.514551       NaN
d  0.432322  0.915981  0.807276  0.021481  0.366002  0.623367

Answer 2

是的，事实上很容易使用concat + swaplevel + sort_index：

v = pd.concat([df0, df1], keys=['exp00', 'exp11'], axis=1)
v.columns = v.columns.swaplevel(0, 1)

v.sort_index(axis=1)

        bar                 baz                 foo          
      exp00     exp11     exp00     exp11     exp00     exp11
a  0.843902  0.536313  0.248536  0.885295  0.589151  0.654772
b       NaN  0.631420       NaN  0.536034       NaN  0.819132
c  0.176537       NaN  0.498181       NaN  0.024562       NaN
d  0.668371  0.911009  0.944589  0.765258  0.081001  0.879989

在新级别的列

2 个答案: