要么我不理解documentation,要么已经过时了。
如果我跑
user[["DOC_ACC_DT", "USER_SIGNON_ID"]].groupby("DOC_ACC_DT").agg(["count"]).resample("1D").fillna(value=0, method="ffill")
得到
TypeError: fillna() got an unexpected keyword argument 'value'
如果我只是跑
.fillna(0)
我得到了
ValueError: Invalid fill method. Expecting pad (ffill), backfill (bfill) or nearest. Got 0
如果我然后设置
.fillna(0, method="ffill")
我得到了
TypeError: fillna() got multiple values for keyword argument 'method'
所以唯一可行的是
.fillna("ffill")
但当然这只是一个向前填充。但是,我想用零替换NaN
。我在这里做错了什么?
答案 0 :(得分:11)
好吧,我不明白为什么上面的代码不起作用,我会等一个人给出一个比这更好的答案,但我刚发现
.replace(np.nan, 0)
完成了我对.fillna(0)
所期望的事情。
答案 1 :(得分:2)
我做了一些测试,非常有趣。
样品:
import pandas as pd
import numpy as np
np.random.seed(1)
rng = pd.date_range('1/1/2012', periods=20, freq='S')
df = pd.DataFrame({'a':['a'] * 10 + ['b'] * 10,
'b':np.random.randint(0, 500, len(rng))}, index=rng)
df.b.iloc[3:8] = np.nan
print (df)
a b
2012-01-01 00:00:00 a 37.0
2012-01-01 00:00:01 a 235.0
2012-01-01 00:00:02 a 396.0
2012-01-01 00:00:03 a NaN
2012-01-01 00:00:04 a NaN
2012-01-01 00:00:05 a NaN
2012-01-01 00:00:06 a NaN
2012-01-01 00:00:07 a NaN
2012-01-01 00:00:08 a 335.0
2012-01-01 00:00:09 a 448.0
2012-01-01 00:00:10 b 144.0
2012-01-01 00:00:11 b 129.0
2012-01-01 00:00:12 b 460.0
2012-01-01 00:00:13 b 71.0
2012-01-01 00:00:14 b 237.0
2012-01-01 00:00:15 b 390.0
2012-01-01 00:00:16 b 281.0
2012-01-01 00:00:17 b 178.0
2012-01-01 00:00:18 b 276.0
2012-01-01 00:00:19 b 254.0
<强>下采样强>:
Resampler.asfreq
的可能解决方案:
如果使用asfreq
,则行为与first
汇总相同:
print (df.groupby('a').resample('2S').first())
a b
a
a 2012-01-01 00:00:00 a 37.0
2012-01-01 00:00:02 a 396.0
2012-01-01 00:00:04 a NaN
2012-01-01 00:00:06 a NaN
2012-01-01 00:00:08 a 335.0
b 2012-01-01 00:00:10 b 144.0
2012-01-01 00:00:12 b 460.0
2012-01-01 00:00:14 b 237.0
2012-01-01 00:00:16 b 281.0
2012-01-01 00:00:18 b 276.0
print (df.groupby('a').resample('2S').first().fillna(0))
a b
a
a 2012-01-01 00:00:00 a 37.0
2012-01-01 00:00:02 a 396.0
2012-01-01 00:00:04 a 0.0
2012-01-01 00:00:06 a 0.0
2012-01-01 00:00:08 a 335.0
b 2012-01-01 00:00:10 b 144.0
2012-01-01 00:00:12 b 460.0
2012-01-01 00:00:14 b 237.0
2012-01-01 00:00:16 b 281.0
2012-01-01 00:00:18 b 276.0
print (df.groupby('a').resample('2S').asfreq().fillna(0))
a b
a
a 2012-01-01 00:00:00 a 37.0
2012-01-01 00:00:02 a 396.0
2012-01-01 00:00:04 a 0.0
2012-01-01 00:00:06 a 0.0
2012-01-01 00:00:08 a 335.0
b 2012-01-01 00:00:10 b 144.0
2012-01-01 00:00:12 b 460.0
2012-01-01 00:00:14 b 237.0
2012-01-01 00:00:16 b 281.0
2012-01-01 00:00:18 b 276.0
如果使用replace
其他值汇总为mean
:
print (df.groupby('a').resample('2S').mean())
b
a
a 2012-01-01 00:00:00 136.0
2012-01-01 00:00:02 396.0
2012-01-01 00:00:04 NaN
2012-01-01 00:00:06 NaN
2012-01-01 00:00:08 391.5
b 2012-01-01 00:00:10 136.5
2012-01-01 00:00:12 265.5
2012-01-01 00:00:14 313.5
2012-01-01 00:00:16 229.5
2012-01-01 00:00:18 265.0
print (df.groupby('a').resample('2S').mean().fillna(0))
b
a
a 2012-01-01 00:00:00 136.0
2012-01-01 00:00:02 396.0
2012-01-01 00:00:04 0.0
2012-01-01 00:00:06 0.0
2012-01-01 00:00:08 391.5
b 2012-01-01 00:00:10 136.5
2012-01-01 00:00:12 265.5
2012-01-01 00:00:14 313.5
2012-01-01 00:00:16 229.5
2012-01-01 00:00:18 265.0
print (df.groupby('a').resample('2S').replace(np.nan,0))
b
a
a 2012-01-01 00:00:00 136.0
2012-01-01 00:00:02 396.0
2012-01-01 00:00:04 0.0
2012-01-01 00:00:06 0.0
2012-01-01 00:00:08 391.5
b 2012-01-01 00:00:10 136.5
2012-01-01 00:00:12 265.5
2012-01-01 00:00:14 313.5
2012-01-01 00:00:16 229.5
2012-01-01 00:00:18 265.0
<强>上采样强>:
使用asfreq
,与replace
相同:
print (df.groupby('a').resample('200L').asfreq().fillna(0))
a b
a
a 2012-01-01 00:00:00.000 a 37.0
2012-01-01 00:00:00.200 0 0.0
2012-01-01 00:00:00.400 0 0.0
2012-01-01 00:00:00.600 0 0.0
2012-01-01 00:00:00.800 0 0.0
2012-01-01 00:00:01.000 a 235.0
2012-01-01 00:00:01.200 0 0.0
2012-01-01 00:00:01.400 0 0.0
2012-01-01 00:00:01.600 0 0.0
2012-01-01 00:00:01.800 0 0.0
2012-01-01 00:00:02.000 a 396.0
2012-01-01 00:00:02.200 0 0.0
2012-01-01 00:00:02.400 0 0.0
...
print (df.groupby('a').resample('200L').replace(np.nan,0))
b
a
a 2012-01-01 00:00:00.000 37.0
2012-01-01 00:00:00.200 0.0
2012-01-01 00:00:00.400 0.0
2012-01-01 00:00:00.600 0.0
2012-01-01 00:00:00.800 0.0
2012-01-01 00:00:01.000 235.0
2012-01-01 00:00:01.200 0.0
2012-01-01 00:00:01.400 0.0
2012-01-01 00:00:01.600 0.0
2012-01-01 00:00:01.800 0.0
2012-01-01 00:00:02.000 396.0
2012-01-01 00:00:02.200 0.0
2012-01-01 00:00:02.400 0.0
...
print ((df.groupby('a').resample('200L').replace(np.nan,0).b ==
df.groupby('a').resample('200L').asfreq().fillna(0).b).all())
True
<强>结论强>:
对于下采样,使用相同的聚合函数,例如sum
,first
或mean
以及上采样asfreq
。
答案 2 :(得分:1)
直接使用fillna
的唯一解决方法是在执行.head(len(df.index))
后调用它。
我假设DF.head
在这种情况下有用主要是因为当重新采样函数应用于groupby对象时,它将充当输入的过滤器,返回原始到期的缩小形状消除群体。
调用DF.head()
不受此转换的影响,并返回整个DF
。
<强>演示:强>
np.random.seed(42)
df = pd.DataFrame(np.random.randn(10, 2),
index=pd.date_range('1/1/2016', freq='10D', periods=10),
columns=['A', 'B']).reset_index()
df
index A B
0 2016-01-01 0.496714 -0.138264
1 2016-01-11 0.647689 1.523030
2 2016-01-21 -0.234153 -0.234137
3 2016-01-31 1.579213 0.767435
4 2016-02-10 -0.469474 0.542560
5 2016-02-20 -0.463418 -0.465730
6 2016-03-01 0.241962 -1.913280
7 2016-03-11 -1.724918 -0.562288
8 2016-03-21 -1.012831 0.314247
9 2016-03-31 -0.908024 -1.412304
<强>运营:强>
resampled_group = df[['index', 'A']].groupby(['index'])['A'].agg('count').resample('2D')
resampled_group.head(len(resampled_group.index)).fillna(0).head(20)
index
2016-01-01 1.0
2016-01-03 0.0
2016-01-05 0.0
2016-01-07 0.0
2016-01-09 0.0
2016-01-11 1.0
2016-01-13 0.0
2016-01-15 0.0
2016-01-17 0.0
2016-01-19 0.0
2016-01-21 1.0
2016-01-23 0.0
2016-01-25 0.0
2016-01-27 0.0
2016-01-29 0.0
2016-01-31 1.0
2016-02-02 0.0
2016-02-04 0.0
2016-02-06 0.0
2016-02-08 0.0
Freq: 2D, Name: A, dtype: float64
答案 3 :(得分:0)
这里的问题是您尝试从fillna
方法返回的DatetimeIndexResampler
对象调用resample
方法。如果您在fillna之前调用聚合函数,则该函数将起作用,例如:df.resample('1H').sum().fillna(0)
答案 4 :(得分:0)
您可以简单地使用 sum()
。
在 https://pandas.pydata.org/docs/reference/api/pandas.core.resample.Resampler.sum.html
基本上有一个 min_count
参数,默认情况下它是 0。这意味着在重新采样后,如果 count(nan) <= min_count
,那么该值将是 nan。但是,由于是0,所以默认情况下该值为0,所以不需要替换或填充。
事实上,如果你想填充一个不为 0 的值,你可以设置 .sum(min_count=1)
然后设置 .replace(float('nan'), 'whatever you want')
示例如下:
import pandas as pd
>>> df = pd.DataFrame({'date': pd.date_range('2021-01-01', '2021-01-07', freq='3D'),
'value': range(3)})
>>> df
date value
0 2021-01-01 0
1 2021-01-04 1
2 2021-01-07 2
>>> df.set_index('date').resample('D').sum().reset_index()
date value
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 1
4 2021-01-05 0
5 2021-01-06 0
6 2021-01-07 2
# if you wanna replace nan with some other values, could also use replace() if more than
# 1 column to replace
>>> df.set_index('date').resample('D').sum(min_count=1).fillna(-10).reset_index()
date value
0 2021-01-01 0.0
1 2021-01-02 -10.0
2 2021-01-03 -10.0
3 2021-01-04 1.0
4 2021-01-05 -10.0
5 2021-01-06 -10.0
6 2021-01-07 2.0