如何合并两个具有不同列索引级别的Pandas数据帧?

时间:2015-03-03 01:06:07

标签: python pandas

我想连接两个具有相同索引但不同列级别的数据帧。一个数据帧具有分层索引,另一个数据帧不具有。

print df1

              A_1               A_2               A_3                .....
              Value_V  Value_y  Value_V  Value_y  Value_V  Value_y

instance200   50       0        6500     1        50       0
instance201   100      0        6400     1        50       0

另一个:

print df2

              PV         Estimate

instance200   2002313    1231233
instance201   2134124    1124724

结果应如下所示:

             PV        Estimate   A_1               A_2               A_3                .....
                                  Value_V  Value_y  Value_V  Value_y  Value_V  Value_y

instance200  2002313   1231233    50       0        6500     1        50       0
instance201  2134124   1124724    100      0        6400     1        50       0

但是框架上的合并或连接会给我一个带有一维列索引的df:

             PV        Estimate   (A_1,Value_V) (A_1,Value_y) (A_2,Value_V) (A_2,Value_y)  .....


instance200  2002313   1231233    50             0             6500         1
instance201  2134124   1124724    100            0             6400         1 

如何保持df1的层次结构索引?

4 个答案:

答案 0 :(得分:6)

也许使用好的任务:

df3 = df1.copy()
df3[df2.columns] = df2

产量

                A_1             A_2             A_3               PV Estimate
            Value_V Value_y Value_V Value_y Value_V Value_y                  
instance200      50       0    6500       1      50       0  2002313  1231233
instance201     100       0    6400       1      50       0  2134124  1124724

答案 1 :(得分:3)

你可以通过使df2与df1具有相同的级别来实现这一点:

In [11]: df1
Out[11]:
                A_1             A_2             A_3
            Value_V Value_y Value_V Value_y Value_V Value_y
instance200      50       0    6500       1      50       0
instance201     100       0    6400       1      50       0

In [12]: df2
Out[12]:
                  PV  Estimate
instance200  2002313   1231233
instance201  2134124   1124724

In [13]: df2.columns = pd.MultiIndex.from_arrays([df2.columns, [None] * len(df2.columns)])

In [14]: df2
Out[14]:
                  PV Estimate
                 NaN      NaN
instance200  2002313  1231233
instance201  2134124  1124724

现在你可以在不破坏列名的情况下进行连接:

In [15]: pd.concat([df1, df2], axis=1)
Out[15]:
                A_1             A_2             A_3               PV Estimate
            Value_V Value_y Value_V Value_y Value_V Value_y      NaN      NaN
instance200      50       0    6500       1      50       0  2002313  1231233
instance201     100       0    6400       1      50       0  2134124  1124724

注意:要让df2列首先使用pd.concat([df2, df1], axis=1)


那就是说,我不确定我能想到一个用例,将它们保持为单独的DataFrame可能实际上是一个更简单的解决方案......!

答案 2 :(得分:0)

我为此目的构建了一个函数,如下所示:

def concat( df1, df2 ):

  """
  Function concatenates two dataframes df1 and df2 even if the two datafames 
  have different number of hierarchical columns levels. 

  In the case of one dataframe having more hierarchical columns levels than the
  other, blank string will be added to the upper hierarchical columns levels
  """

  nLevels1 = df1.columns.nlevels
  nLevels2 = df2.columns.nlevels
  diff     = nLevels2 - nLevels1

  print(diff)

  if nLevels1 == nLevels2:
    # if the same simply concat as normal
    return pd.concat( [df1, df2 ], axis = 1 )

  elif nLevels1 < nLevels2:
    # if there is a difference expand smaller dataframe with blank strings, then concat

    a = [[""] * len( df1.columns )] * np.abs(diff)
    a.append( df1.columns.to_list() )
    df1.columns = a

    return pd.concat( [df1, df2 ], axis = 1 )

  elif nLevels1 > nLevels2:
    # if there is a difference expand smaller dataframe with blank strings, then concat

    a = [[""] * len( df2.columns )] * np.abs(diff)
    a.append( df2.columns.to_list() )
    df1.columns = a

    return pd.concat( [df1, df2 ], axis = 1 )

现在,如果我们提供数据框

gender  f  m
            
n       2  1
y       2  2

gender        f                         m             
age         old        young          old        young
location london paris london paris london paris london
                                                      
n             1     0      1     0      0     1      0
y             0     1      0     1      1     0      1

我们得到

             f                         m                   
            old        young          old        young      
         london paris london paris london paris london  f  m
                                                            
n             1     0      1     0      0     1      0  2  1
y             0     1      0     1      1     0      1  2  2

请注意,将来最好加入类别性别,使它们处于同一级别,但这在很大程度上是为了联接具有完全不同列的数据框。

答案 3 :(得分:0)

我为 pandas.concat 函数制作了一个包装器,它接受级别数不等的数据帧。

空层是从下面添加的。优点是它允许使用 df_cols.c 访问系列(在下面的 df_cols 中),并且在打印时,明确 'c' 不是 {{ 1}}。

('CC', 'one')

希望这对某人有所帮助。

测试:

def concat(dfs, axis=0, *args, **kwargs):   
    """
    Wrapper for `pandas.concat'; concatenate pandas objects even if they have 
    unequal number of levels on concatenation axis.
    
    Levels containing empty strings are added from below (when concatenating along
    columns) or right (when concateniting along rows) to match the maximum number 
    found in the dataframes.
    
    Parameters
    ----------
    dfs : Iterable
        Dataframes that must be concatenated.
    axis : int, optional
        Axis along which concatenation must take place. The default is 0.

    Returns
    -------
    pd.DataFrame
        Concatenated Dataframe.
    
    Notes
    -----
    Any arguments and kwarguments are passed onto the `pandas.concat` function.
    
    See also
    --------
    pandas.concat
    """
    def index(df):
        return df.columns if axis==1 else df.index
    
    def add_levels(df):
        need = want - index(df).nlevels
        if need > 0:
            df = pd.concat([df], keys=[('',)*need], axis=axis) # prepend empty levels
            for i in range(want-need): # move empty levels to bottom
                df = df.swaplevel(i, i+need, axis=axis) 
        return df
    
    want = np.max([index(df).nlevels for df in dfs])    
    dfs = [add_levels(df) for df in dfs]
    return pd.concat(dfs, axis=axis, *args, **kwargs)