平,
我想将数据帧(偶然地与多索引)组合成更大的数据帧。有时需要附加数据(添加新行或列),有时需要更新现有数据。不知怎的,我找不到两种办法。它是追加(使用.append())或某种更新(.merge(),. update()) 我试图搜索这个并阅读文档,但无法弄明白。
这是测试代码
import pandas as pd
import numpy as np
zones = ['A', 'B', 'C']
# input data frames
dates0 = pd.date_range('20180101', '20180102', freq='D')
dates1 = pd.date_range('20180103', '20180104', freq='D')
idx00 = pd.MultiIndex.from_product(iterables=[dates0, [zones[0]]], names= ['UTC', 'zone'])
df00 = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[1, 2], [3, 4]])
idx01 = pd.MultiIndex.from_product(iterables=[dates1, [zones[0]]], names=['UTC', 'zone'])
df01 = pd.DataFrame(index=idx01, columns=['a', 'b'], data=[[5, 6], [7, 8]])
idx10 = pd.MultiIndex.from_product(iterables=[dates0, [zones[1]]], names=['UTC', 'zone'])
df10 = pd.DataFrame(index=idx10, columns=['b', 'c'], data=np.random.rand(2, 2))
idx11 = pd.MultiIndex.from_product(iterables=[dates1, [zones[1]]], names=['UTC', 'zone'])
df11 = pd.DataFrame(index=idx11, columns=['b', 'c'], data=np.random.rand(2, 2))
# append - works, but only if the data is not yet there
df_append = df00.append(df01)
df_append = df_append.append(df10)
df_append = df_append.append(df11)
df_append.sort_index(inplace=True)
df_append
# append adds a second data point, where there should only be one
df00b = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[10, 20], [30, 40]])
df_append2 = df_append.append(df00b)
df_append2.sort_index(inplace=True)
df_append2.loc[('2018-01-01', 'A'), :]
# merge - does not what I want, changes column names
df_merge = df00.merge(df01, how='outer', left_index=True, right_index=True)
df_merge
# update - does not what I want, does not add new columns
df_update = df00
df_update.update(df01)
df_update
# join - gives an error, as no suffix defined and join wants to create a new column
df_join = df00
df00.join(df01)
**我的问题** .append()仅在右数据框中的区域(索引+列)尚未位于左侧数据框中时才有效。否则,它只是将第二个数据点添加到同一索引/列
.merge()更改列名称(如果它们同时存在于左侧和右侧数据框中)。但是我希望列名保持不变,如果数据已经存在则需要更新
如果列/行不存在,.update()不附加数据
.join()给出错误..
我需要的是"更新+附加(如果不存在)"。知道如何去做吗?
提前致谢,Theo
pS:从上面输出
df_append
a b c
UTC zone
2018-01-01 A 1.0 2.000000 NaN
B NaN 0.100551 0.271616
2018-01-02 A 3.0 4.000000 NaN
B NaN 0.489322 0.606215
2018-01-03 A 5.0 6.000000 NaN
B NaN 0.245451 0.242021
2018-01-04 A 7.0 8.000000 NaN
B NaN 0.047900 0.642140
df_append2.loc [(' 2018-01-01',' A'),:]
a b c
UTC zone
2018-01-01 A 1.0 2.0 NaN
A 10.0 20.0 NaN
df_merge
Out[4]:
a_x b_x a_y b_y
UTC zone
2018-01-01 A 1.0 2.0 NaN NaN
2018-01-02 A 3.0 4.0 NaN NaN
2018-01-03 A NaN NaN 5.0 6.0
2018-01-04 A NaN NaN 7.0 8.0
答案 0 :(得分:1)
看起来您可以使用pd.concat()
或df00.append()
,两者都会这样做。使用您的样本数据,我们可以这样组合:
pd.concat([df00, df01])
您可以将verify_integrity=True
传递给其中一个,以便在存在重复项时抛出错误。或者,如果存在重叠值以避免错误,您可以连接/追加并与.drop_duplicates()
结合使用:
df_concat = pd.concat([df00, df01]).drop_duplicates(keep='last')
由于上述内容会删除重复行而不考虑索引,因此您可以尝试这种方法:
sample df (with duplicate rows, not index):
a b
UTC zone
2018-01-01 A 1 2
2018-01-02 A 3 4
2018-01-03 A 1 2
2018-01-04 A 7 8
df_concat = pd.concat([df00, df01]).groupby(level=[0,1]).last()
a b
UTC zone
2018-01-01 A 1 2
2018-01-02 A 3 4
2018-01-03 A 1 2
2018-01-04 A 7 8