我正在尝试用零填充数据框,但我不想触及领先的NaN :
rng = pd.date_range('2016-06-01', periods=9, freq='D')
df = pd.DataFrame({'data': pd.Series([np.nan]*3 + [20, 30, 40] + [np.nan]*3, rng)})
2016-06-01 NaN
2016-06-02 NaN
2016-06-03 NaN
2016-06-04 20.0
2016-06-05 30.0
2016-06-06 40.0
2016-06-07 NaN
2016-06-08 NaN
2016-06-09 NaN
填充/替换后我想要的是:
pd.DataFrame({'data': pd.Series([np.nan]*3 + [20, 30, 40] + [0.]*3, rng)})
2016-06-01 NaN
2016-06-02 NaN
2016-06-03 NaN
2016-06-04 20.0
2016-06-05 30.0
2016-06-06 40.0
2016-06-07 0.0
2016-06-08 0.0
2016-06-09 0.0
由于fillna()
仅允许值或方法而fillna(0)
替换所有NaN,包括前导,所以我希望替换可以跳到此处,但
df.replace([np.nan], 0, method='ffill')
也取代所有NaN。
如何仅在第一个非NaN值之后填充值,同时还有多个数据列?
答案 0 :(得分:4)
您可以使用last_valid_index()功能
来完成In [80]: df
Out[80]:
data data1 data2
2016-06-01 NaN NaN NaN
2016-06-02 NaN NaN 10.0
2016-06-03 NaN 20.0 20.0
2016-06-04 20.0 30.0 20.0
2016-06-05 NaN 40.0 NaN
2016-06-06 40.0 30.0 40.0
2016-06-07 NaN NaN NaN
2016-06-08 NaN NaN NaN
2016-06-09 NaN NaN NaN
In [81]: %paste
first_valid_idx = df.apply(lambda x: x.first_valid_index()).to_frame()
df = df.fillna(0)
for ix, r in first_valid_idx.iterrows():
df.loc[df.index < r[0], ix] = np.nan
## -- End pasted text --
In [82]: df
Out[82]:
data data1 data2
2016-06-01 NaN NaN NaN
2016-06-02 NaN NaN 10.0
2016-06-03 NaN 20.0 20.0
2016-06-04 20.0 30.0 20.0
2016-06-05 0.0 40.0 0.0
2016-06-06 40.0 30.0 40.0
2016-06-07 0.0 0.0 0.0
2016-06-08 0.0 0.0 0.0
2016-06-09 0.0 0.0 0.0
In [83]: first_valid_idx
Out[83]:
0
data 2016-06-04
data1 2016-06-03
data2 2016-06-02
OLD回答:
In [38]: df.loc[df.index > df.data.last_valid_index(), 'data'] = 0
In [39]: df
Out[39]:
data
2016-06-01 NaN
2016-06-02 NaN
2016-06-03 NaN
2016-06-04 20.0
2016-06-05 30.0
2016-06-06 40.0
2016-06-07 0.0
2016-06-08 0.0
2016-06-09 0.0
答案 1 :(得分:3)
我认为您可以group
NaN
的{{1}} print (df.data.notnull().cumsum())
2016-06-01 0
2016-06-02 0
2016-06-03 0
2016-06-04 1
2016-06-05 2
2016-06-06 3
2016-06-07 3
2016-06-08 3
2016-06-09 3
Freq: D, Name: data, dtype: int32
print (df.data.mask(df.data.notnull().cumsum() != 0, df.data.fillna(0)))
2016-06-01 NaN
2016-06-02 NaN
2016-06-03 NaN
2016-06-04 20.0
2016-06-05 30.0
2016-06-06 40.0
2016-06-07 0.0
2016-06-08 0.0
2016-06-09 0.0
Freq: D, Name: data, dtype: float64
isnull
,然后cumsum
找到fillna
所有其他值:
df = pd.DataFrame({'data': pd.Series([np.nan]*3 + [20, 30, 40] + [np.nan]*3, rng),
'data1': pd.Series([np.nan]*2 + [20, 30, 40,30] + [np.nan]*3, rng),
'data2': pd.Series([np.nan]*1 + [10,20, 20, 30, 40] + [np.nan]*3, rng)})
print (df.mask(df.notnull().cumsum() != 0, df.fillna(0)))
data data1 data2
2016-06-01 NaN NaN NaN
2016-06-02 NaN NaN 10.0
2016-06-03 NaN 20.0 20.0
2016-06-04 20.0 30.0 20.0
2016-06-05 30.0 40.0 30.0
2016-06-06 40.0 30.0 40.0
2016-06-07 0.0 0.0 0.0
2016-06-08 0.0 0.0 0.0
2016-06-09 0.0 0.0 0.0
编辑:
对于多列,它也很好用:
print (df.mask(df.notnull().cummax(), df.fillna(0)))
data data1 data2
2016-06-01 NaN NaN NaN
2016-06-02 NaN NaN 10.0
2016-06-03 NaN 20.0 20.0
2016-06-04 20.0 30.0 20.0
2016-06-05 30.0 40.0 30.0
2016-06-06 40.0 30.0 40.0
2016-06-07 0.0 0.0 0.0
2016-06-08 0.0 0.0 0.0
2016-06-09 0.0 0.0 0.0
$json_object=json_decode($jsonString);