Pandas(多索引)追加/合并/更新

时间:2018-01-22 09:45:43

标签: pandas merge append multi-index

平,

我想将数据帧(偶然地与多索引)组合成更大的数据帧。有时需要附加数据(添加新行或列),有时需要更新现有数据。不知怎的,我找不到两种办法。它是追加(使用.append())或某种更新(.merge(),. update()) 我试图搜索这个并阅读文档,但无法弄明白。

这是测试代码

import pandas as pd
import numpy as np

zones = ['A', 'B', 'C']

# input data frames
dates0 = pd.date_range('20180101', '20180102', freq='D')
dates1 = pd.date_range('20180103', '20180104', freq='D')

idx00 = pd.MultiIndex.from_product(iterables=[dates0, [zones[0]]], names=    ['UTC', 'zone'])
df00 = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[1, 2], [3, 4]])

idx01 = pd.MultiIndex.from_product(iterables=[dates1, [zones[0]]], names=['UTC', 'zone'])
df01 = pd.DataFrame(index=idx01, columns=['a', 'b'], data=[[5, 6], [7, 8]])

idx10 = pd.MultiIndex.from_product(iterables=[dates0, [zones[1]]], names=['UTC', 'zone'])
df10 = pd.DataFrame(index=idx10, columns=['b', 'c'], data=np.random.rand(2, 2))

idx11 = pd.MultiIndex.from_product(iterables=[dates1, [zones[1]]], names=['UTC', 'zone'])
df11 = pd.DataFrame(index=idx11, columns=['b', 'c'], data=np.random.rand(2, 2))

# append - works, but only if the data is not yet there
df_append = df00.append(df01)
df_append = df_append.append(df10)
df_append = df_append.append(df11)
df_append.sort_index(inplace=True)
df_append

# append adds a second data point, where there should only be one
df00b = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[10, 20], [30, 40]])
df_append2 = df_append.append(df00b)
df_append2.sort_index(inplace=True)
df_append2.loc[('2018-01-01', 'A'), :]

# merge - does not what I want, changes column names
df_merge = df00.merge(df01, how='outer', left_index=True, right_index=True)
df_merge

# update - does not what I want, does not add new columns
df_update = df00
df_update.update(df01)
df_update

# join - gives an error, as no suffix defined and join wants to create a new column
df_join = df00
df00.join(df01)

**我的问题** .append()仅在右数据框中的区域(索引+列)尚未位于左侧数据框中时才有效。否则,它只是将第二个数据点添加到同一索引/列

.merge()更改列名称(如果它们同时存在于左侧和右侧数据框中)。但是我希望列名保持不变,如果数据已经存在则需要更新

如果列/行不存在,

.update()不附加数据

.join()给出错误..

我需要的是"更新+附加(如果不存在)"。知道如何去做吗?

提前致谢,Theo

pS:从上面输出

df_append

                   a         b         c
UTC        zone                         
2018-01-01 A     1.0  2.000000       NaN
           B     NaN  0.100551  0.271616
2018-01-02 A     3.0  4.000000       NaN
           B     NaN  0.489322  0.606215
2018-01-03 A     5.0  6.000000       NaN
           B     NaN  0.245451  0.242021
2018-01-04 A     7.0  8.000000       NaN
           B     NaN  0.047900  0.642140

df_append2.loc [(' 2018-01-01',' A'),:]

                    a     b   c
UTC        zone                
2018-01-01 A      1.0   2.0 NaN
           A     10.0  20.0 NaN

df_merge

Out[4]: 
                 a_x  b_x  a_y  b_y
UTC        zone    
2018-01-01 A     1.0  2.0  NaN  NaN
2018-01-02 A     3.0  4.0  NaN  NaN
2018-01-03 A     NaN  NaN  5.0  6.0
2018-01-04 A     NaN  NaN  7.0  8.0

1 个答案:

答案 0 :(得分:1)

看起来您可以使用pd.concat()df00.append(),两者都会这样做。使用您的样本数据,我们可以这样组合:

pd.concat([df00, df01])

您可以将verify_integrity=True传递给其中一个,以便在存在重复项时抛出错误。或者,如果存在重叠值以避免错误,您可以连接/追加并与.drop_duplicates()结合使用:

df_concat = pd.concat([df00, df01]).drop_duplicates(keep='last')

.filter()

由于上述内容会删除重复行而不考虑索引,因此您可以尝试这种方法:

sample df (with duplicate rows, not index):
                 a  b
UTC        zone      
2018-01-01 A     1  2
2018-01-02 A     3  4
2018-01-03 A     1  2
2018-01-04 A     7  8

df_concat = pd.concat([df00, df01]).groupby(level=[0,1]).last()
                 a  b
UTC        zone      
2018-01-01 A     1  2
2018-01-02 A     3  4
2018-01-03 A     1  2
2018-01-04 A     7  8