我正在努力从现有的数据库中创建一个数据库。合并我需要的信息后,我得到了在它们两个中都重复的行。
2018-11-22 Iraq 13984.75 3000.0 NaN
2018-11-22 Iraq NaN NaN Heavy Rain
所需的输出:
2018-11-22 Iraq 13984.75 3000.0 Heavy Rain
现在,我想将它们合并为一个。图像中看到的几乎每个值都仅出现在其中一行中,而另一行则具有NaN值。所以我想用另一行中的值代替NaN。但是,可能在两行中都定义了一些值,例如结束日期,在这种情况下,我想保留较大的值。
有没有办法用熊猫来做到这一点?
答案 0 :(得分:3)
我相信您需要:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,50,5,np.nan],
'C':[7,np.nan,9,4,2,3],
'E':[np.nan,30,60,9,np.nan,4],
'F':['s','d','f',np.nan,'r',np.nan]
}, index=pd.date_range('2011-01-01', periods=6))
df2 = pd.DataFrame({
'A':list('ertyui'),
'B':[4,np.nan,6,5,5,8],
'C':[7,np.nan,9,20,2,3],
'E':[8,np.nan,3,6,90,np.nan],
'F':[np.nan,'d',np.nan,'f','r',np.nan]
}, index=pd.date_range('2011-01-01', periods=6))
前concat
个都DataFrames
:
df = pd.concat([df1, df2])
print (df)
A B C E F
2011-01-01 a 4.0 7.0 NaN s
2011-01-02 b NaN NaN 30.0 d
2011-01-03 c 4.0 9.0 60.0 f
2011-01-04 d 50.0 4.0 9.0 NaN
2011-01-05 e 5.0 2.0 NaN r
2011-01-06 f NaN 3.0 4.0 NaN
2011-01-01 e 4.0 7.0 8.0 NaN
2011-01-02 r NaN NaN NaN d
2011-01-03 t 6.0 9.0 3.0 NaN
2011-01-04 y 5.0 20.0 6.0 f
2011-01-05 u 5.0 2.0 90.0 r
2011-01-06 i 8.0 3.0 NaN NaN
然后仅选择具有select_dtypes
的数字列,并为每个索引汇总max
:
df11 = df.select_dtypes(np.number).max(level=0)
print (df11)
B C E
2011-01-01 4.0 7.0 8.0
2011-01-02 NaN NaN 30.0
2011-01-03 6.0 9.0 60.0
2011-01-04 50.0 20.0 9.0
2011-01-05 5.0 2.0 90.0
2011-01-06 8.0 3.0 4.0
对于字符串列,聚合first
-每个组的第一个非NaN值:
df12 = df.select_dtypes(object).groupby(level=0).first()
print (df12)
A F
2011-01-01 a s
2011-01-02 b d
2011-01-03 c f
2011-01-04 d f
2011-01-05 e r
2011-01-06 f NaN
最后连接在一起,对于相同顺序的列,请使用reindex
:
out = pd.concat([df11, df12], axis=1).reindex(columns=df.columns)
print (out)
A B C E F
2011-01-01 a 4.0 7.0 8.0 s
2011-01-02 b NaN NaN 30.0 d
2011-01-03 c 6.0 9.0 60.0 f
2011-01-04 d 50.0 20.0 9.0 f
2011-01-05 e 5.0 2.0 90.0 r
2011-01-06 f 8.0 3.0 4.0 NaN