我有一个csv数据集,我正在尝试使用sklearn。目标是预测未来的网络流量。但是,我的数据集在没有访问者的日子里包含零,我想保留该值。有更多的日子,零游客,然后有访客(这是一个小小的网站)。以下是数据
Col1是日期:
11年10月1日
11年10月2日
11年10月3日
等....
Col2是访客人数:
12个
1
0
0
1
5
0
0
等....
sklearn似乎将零值解释为NaN值,这是可以理解的。如何在逻辑函数中使用这些零值(甚至可能)?
更新: 估算器为https://github.com/facebookincubator/prophet,当我执行以下操作时:
df = pd.read_csv('~/tmp/datafile.csv')
df['y'] = np.log(df['y'])
df.head()
m = Prophet()
m.fit(df);
future = m.make_future_dataframe(periods=365)
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
m.plot(forecast);
m.plot_components(forecast);
plt.show
我得到以下内容:
growthprediction.py:7: RuntimeWarning: divide by zero encountered in log
df['y'] = np.log(df['y'])
/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py:307: RuntimeWarning: invalid value encountered in double_scalars
k = (df['y_scaled'].ix[i1] - df['y_scaled'].ix[i0]) / T
Traceback (most recent call last):
File "growthprediction.py", line 11, in <module>
m.fit(df);
File "/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py", line 387, in fit
params = model.optimizing(dat, init=stan_init, iter=1e4)
File "/usr/local/lib/python3.6/site-packages/pystan/model.py", line 508, in optimizing
ret, sample = fit._call_sampler(stan_args)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 804, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.StanFit4Model._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:16585)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 398, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:8818)
RuntimeError: k initialized to invalid value (nan)
答案 0 :(得分:1)
在你的代码的这一行:
df['y'] = np.log(df['y'])
当你的df ['y']为零时,你的对数为0,这导致结果数据集中出现警告和NaN,因为没有定义0的对数。
sklearn本身不会将零值解释为NaN,除非您在预处理中用NaN替换它们。