合并DataFrames

时间:2016-08-15 13:16:40

标签: python pandas

假设我们有以下DataFrames:

import pandas as pd
import numpy as np

df1_column_array = [['foo', 'bar'],
          ['A', 'B']]
df1_column_tuple = list(zip(*df1_column_array))    
df1_column_header = pd.MultiIndex.from_tuples(df1_column_tuple)

df1_index_array = [['one','two'],
          ['0', '1']]
df1_index_tuple = list(zip(*df1_index_array))  
df1_index_header = pd.MultiIndex.from_tuples(df1_index_tuple)


df1 = pd.DataFrame(np.random.rand(2,2), columns = df1_column_header, index = df1_index_header)
print(df1)
            foo       bar
              A         B
one 1  0.755296  0.101329
two 2  0.925653  0.587948

df2_column_array = [['alpha', 'beta'],
          ['C', 'D']]
df2_column_tuple = list(zip(*df2_column_array))    
df2_column_header = pd.MultiIndex.from_tuples(df2_column_tuple)

df2_index_array = [['three', 'four'],
          ['3', '4']]
df2_index_tuple = list(zip(*df2_index_array))  
df2_index_header = pd.MultiIndex.from_tuples(df2_index_tuple)


df2 = pd.DataFrame(np.random.rand(2,2), columns = df2_column_header, index = df2_index_header)
print(df2)
            alpha      beta
                C         D
three 3  0.751013  0.957824
four  4  0.879353  0.045079

我想将这些DataFrame组合起来产生:

              foo       bar     alpha      beta
                A         B         C         D
one   1  0.755296  0.101329       NaN       NaN
two   2  0.925653  0.587948       NaN       NaN
three 3       NaN       NaN  0.751013  0.957824
four  4       NaN       NaN  0.879353  0.045079

当我尝试使用concat时,会保留索引的顺序,但不会保留列:

df_joined = pd.concat([df1,df2])
print(df_joined)
            alpha       bar      beta       foo
                C         B         D         A
one   1       NaN  0.101329       NaN  0.755296
two   2       NaN  0.587948       NaN  0.925653
three 3  0.751013       NaN  0.957824       NaN
four  4  0.879353       NaN  0.045079       NaN

当我尝试加入时,列的顺序会被保留,但不会保留索引:

df_joined = df1.join(df2, how = 'outer')
print(df_joined)
              foo       bar     alpha      beta
                A         B         C         D
four  4       NaN       NaN  0.879353  0.045079
one   1  0.755296  0.101329       NaN       NaN
three 3       NaN       NaN  0.751013  0.957824
two   2  0.925653  0.587948       NaN       NaN

组合DataFrame时,如何保留列和索引的顺序?

编辑1: 请注意:这是样本数据。我的真实世界数据没有方便的标签(例如1,2,3,4)可以排序。

编辑2: 将建议的解决方案应用于我的真实世界数据时,我收到以下错误:

Exception: cannot handle a non-unique multi-index!

1 个答案:

答案 0 :(得分:1)

您可以使用hack - 第一个concat获取Multiindex,然后reindex输出第二个concat

idx = pd.concat([df1,df2]).index
df_joined = pd.concat([df1,df2], axis=1).reindex(idx)
print (df_joined)
              foo       bar     alpha      beta
                A         B         C         D
one   0  0.269298  0.819375       NaN       NaN
two   1  0.574702  0.798920       NaN       NaN
three 3       NaN       NaN  0.436893  0.822041
four  4       NaN       NaN  0.757332  0.271900

使用DataFrames创建Multiindexes,加快解决方案并获取index

idx = pd.concat([pd.DataFrame(df1.index, index=df1.index),
                 pd.DataFrame(df2.index, index=df2.index)]).index
df_joined = pd.concat([df1,df2], axis=1).reindex(idx)
print (df_joined)
              foo       bar     alpha      beta
                A         B         C         D
one   0  0.007644  0.341335       NaN       NaN
two   1  0.332005  0.449688       NaN       NaN
three 3       NaN       NaN  0.281876  0.883299
four  4       NaN       NaN  0.880252  0.061797

EDIT1:

之前的解决方案问题reindex讨厌重复。 因此,如果列中的Multiindex不重复,您可以使用:

print(df1)
            foo       bar
              A         B
one 0  0.384705  0.932928
    0  0.539197  0.519196

print(df2)
            alpha      beta
                C         D
three 3  0.957530  0.985926
four  4  0.479828  0.350042

cols = df1.join(df2, how = 'outer').columns
df_joined = pd.concat([df1,df2]).reindex(columns=cols)
print (df_joined)
              foo       bar     alpha      beta
                A         B         C         D
one   0  0.384705  0.932928       NaN       NaN
      0  0.539197  0.519196       NaN       NaN
three 3       NaN       NaN  0.957530  0.985926
four  4       NaN       NaN  0.479828  0.350042