我从一系列仪器中收集了一些重叠的数据。我想将它们合并到单个pandas数据结构中,如果不是NaN,则每列的最新可用数据优先,否则保留较旧的数据。
以下代码产生预期的输出,但是涉及许多用于完成此简单任务的代码。此外,最后一步涉及标识重复的索引值,我担心是否可以依靠“最后”部分,因为df.combine_first(other)对数据进行了重新排序。有没有更紧凑,有效和/或可预测的方式来做到这一点?
# set up the data
df0 = pd.DataFrame({"x": [0.,1.,2.,3.,4,],"y":[0.,1.,2.,3.,np.nan],"t" :[0,1,2,3,4]}) # oldest/lowest priority
df1 = pd.DataFrame({"x" : [np.nan,4.1,5.1,6.1],"y":[3.1,4.1,5.1,6.1],"t": [3,4,5,6]})
df2 = pd.DataFrame({"x" : [8.2,10.2],"t":[8,10]})
df0.set_index("t",inplace=True)
df1.set_index("t",inplace=True)
df2.set_index("t",inplace=True)
# this concatenates, leaving redundant indices in df0, df1, df2
dfmerge = pd.concat((df0,df1,df2),sort=True)
print("dfmerge, with duplicate rows and interlaced NaN data")
print(dfmerge)
# Now apply, in priority order, each of the original dataframes to fill the original
dfmerge2 = dfmerge.copy()
for ddf in (df2,df1,df0):
dfmerge2 = dfmerge2.combine_first(ddf)
print("\ndfmerge2, fillable NaNs filled but duplicate indices now reordered")
print(dfmerge2) # row order has changed unpredictably
# finally, drop duplicate indices
dfmerge3 = dfmerge2.copy()
dfmerge3 = dfmerge3.loc[~dfmerge3.index.duplicated(keep='last')]
print ("dfmerge3, final")
print (dfmerge3)
其输出是这样:
dfmerge, with duplicate rows and interlaced NaN data
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 NaN
3 NaN 3.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
dfmerge2, fillable NaNs filled but duplicate indices now reordered
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
3 3.0 3.1
4 4.0 4.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
dfmerge3, final
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
答案 0 :(得分:0)
以您的情况
s=pd.concat([df0,df1,df2],sort=False)
s[:]=np.sort(s,axis=0)
s=s.dropna(thresh=1)
s
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 3.1
3 4.1 4.1
4 5.1 5.1
5 6.1 6.1
6 8.2 NaN
8 10.2 NaN