Python / Pandas与NaN数据合并问题

时间:2016-10-31 14:30:43

标签: python pandas dataframe

我正在尝试使用pd.concat将两个数据帧( df df2 )合并到一个新的数据帧( df3 )中熊猫使用以下代码:

df3 = pd.concat([df,df2])

这几乎按我想要的方式工作,但它会产生一个问题。

df 包含当前日期的数据,索引是时间序列。它看起来像这样:

                        Facility    Servers   PUE
2016-10-31  00:00:00    6.0         5.0       1.2
2016-10-31  00:30:00    7.0         5.0       1.4
2016-10-31  01:00:00    6.0         5.0       1.2
2016-10-31  01:30:00    6.0         5.0       1.2
2016-10-31  02:00:00    6.0         5.0       1.2

df2 仅包含NaN数据,索引是一个时间序列,其格式与 df 中的格式相对应,但从较早的日期开始并继续完整年(即17520行对应365 * 48三十分钟的间隔)。看起来基本上是这样的:

                        Facility    Servers   PUE
2016-10-01  00:00:00    NaN         NaN       NaN
2016-10-01  00:30:00    NaN         NaN       NaN
2016-10-01  01:00:00    NaN         NaN       NaN
2016-10-01  01:30:00    NaN         NaN       NaN
2016-10-01  02:00:00    NaN         NaN       NaN
2016-10-01  02:30:00    NaN         NaN       NaN
<continues to 17520 rows, i.e. one year of 30 minute time intervals>

申请时: df3 = pd.concat([df,df2])

然后运行 df3.head(),我得到以下内容:

                        Facility    Servers   PUE
2016-10-31  00:00:00    6.0         5.0       1.2
2016-10-31  00:30:00    7.0         5.0       1.4
2016-10-31  01:00:00    6.0         5.0       1.2
2016-10-31  01:30:00    6.0         5.0       1.2
2016-10-31  02:00:00    6.0         5.0       1.2
2016-10-31  02:30:00    NaN         NaN       NaN
2016-10-31  03:00:00    NaN         NaN       NaN
2016-10-31  03:30:00    NaN         NaN       NaN
<continues to the end of the year>

换句话说,代码似乎删除了 df 中数据之前发生的时间间隔的所有NaN数据。任何人都可以建议如何保留 df2 中的所有数据,仅将数据替换为 df 的相应时间间隔?

2 个答案:

答案 0 :(得分:1)

我认为您indexes print (df2.index.union(df.index)) DatetimeIndex(['2016-10-01 00:00:00', '2016-10-01 00:30:00', '2016-10-01 01:00:00', '2016-10-01 01:30:00', '2016-10-01 02:00:00', '2016-10-01 02:30:00', '2016-10-31 00:00:00', '2016-10-31 00:30:00', '2016-10-31 01:00:00', '2016-10-31 01:30:00', '2016-10-31 02:00:00'], dtype='datetime64[ns]', freq=None) df = df.reindex(df2.index.union(df.index)) print (df) Facility Servers PUE 2016-10-01 00:00:00 NaN NaN NaN 2016-10-01 00:30:00 NaN NaN NaN 2016-10-01 01:00:00 NaN NaN NaN 2016-10-01 01:30:00 NaN NaN NaN 2016-10-01 02:00:00 NaN NaN NaN 2016-10-01 02:30:00 NaN NaN NaN 2016-10-31 00:00:00 6.0 5.0 1.2 2016-10-31 00:30:00 7.0 5.0 1.4 2016-10-31 01:00:00 6.0 5.0 1.2 2016-10-31 01:30:00 6.0 5.0 1.2 2016-10-31 02:00:00 6.0 5.0 1.2 reindex需要union

var activeDirectoryClientSettings = ActiveDirectoryClientSettings.UsePromptOnly(nativeClientAppCLIENTID, new Uri("https://xxx.azurewebsites.net")); 
return UserTokenProvider.LoginWithPromptAsync(domainName, activeDirectoryClientSettings).Result;

答案 1 :(得分:1)

使用 combine_first

result = df1.combine_first(df2)

结果只会从右侧DataFrame中获取值,如果它们在左侧DataFrame中缺失