我正在使用pandas 0.17.0并且df
与此类似:
df.head()
Out[339]:
A B C
DATE_TIME
2016-10-08 13:57:00 in 5.61 1
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:17:00 out 8.18 0
df.tail()
Out[340]:
A B C
DATE_TIME
2016-11-08 13:42:00 in 8.00 0
2016-11-08 13:47:00 in 7.99 0
2016-11-08 13:52:00 out 7.97 0
2016-11-08 13:57:00 in 8.14 1
2016-11-08 14:02:00 in 8.16 1
以下dtypes
:
print (df.dtypes)
A object
B float64
C int64
dtype: object
当我将我的df
重新索引到分钟间隔时,所有列int64
都会更改为float64
。
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index)
print (df2.dtypes)
A object
B float64
C float64
dtype: object
另外,如果我尝试重新取样
df3 = df.resample('Min')
int64
会变成float64
,由于某种原因,我会遗漏object
列。
print (df3.dtypes)
print (df3.dtypes)
B float64
C float64
dtype: object
由于我希望在后续步骤(在将df
与另一个df
连接起来之后)基于此区别对列进行不同的插值,因此我需要它们来维护其原始dtype
。我的真实df
每种类型的列数要多得多,因此我正在寻找一种不依赖于按标签单独调用列的解决方案。
有没有办法在整个重建索引中保持dtype
?或者有没有办法如何在事后为它们分配dtype
(它们是除了NAN之外仅包含整数的唯一列)?
有人能帮助我吗?
答案 0 :(得分:6)
它是impossible,因为如果某个列中至少有一个NaN
值,则int
会转换为float
。
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index)
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1.0
2016-10-08 13:58:00 NaN NaN NaN
2016-10-08 13:59:00 NaN NaN NaN
2016-10-08 14:00:00 NaN NaN NaN
2016-10-08 14:01:00 NaN NaN NaN
2016-10-08 14:02:00 in 8.05 1.0
2016-10-08 14:03:00 NaN NaN NaN
2016-10-08 14:04:00 NaN NaN NaN
2016-10-08 14:05:00 NaN NaN NaN
2016-10-08 14:06:00 NaN NaN NaN
2016-10-08 14:07:00 in 7.92 0.0
2016-10-08 14:08:00 NaN NaN NaN
2016-10-08 14:09:00 NaN NaN NaN
2016-10-08 14:10:00 NaN NaN NaN
2016-10-08 14:11:00 NaN NaN NaN
2016-10-08 14:12:00 in 7.98 0.0
2016-10-08 14:13:00 NaN NaN NaN
2016-10-08 14:14:00 NaN NaN NaN
2016-10-08 14:15:00 NaN NaN NaN
2016-10-08 14:16:00 NaN NaN NaN
2016-10-08 14:17:00 out 8.18 0.0
print (df2.dtypes)
A object
B float64
C float64
dtype: object
但如果在reindex
中使用参数fill_value
,则dtypes
不会更改:
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index, fill_value=0)
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1
2016-10-08 13:58:00 0 0.00 0
2016-10-08 13:59:00 0 0.00 0
2016-10-08 14:00:00 0 0.00 0
2016-10-08 14:01:00 0 0.00 0
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:03:00 0 0.00 0
2016-10-08 14:04:00 0 0.00 0
2016-10-08 14:05:00 0 0.00 0
2016-10-08 14:06:00 0 0.00 0
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:08:00 0 0.00 0
2016-10-08 14:09:00 0 0.00 0
2016-10-08 14:10:00 0 0.00 0
2016-10-08 14:11:00 0 0.00 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:13:00 0 0.00 0
2016-10-08 14:14:00 0 0.00 0
2016-10-08 14:15:00 0 0.00 0
2016-10-08 14:16:00 0 0.00 0
2016-10-08 14:17:00 out 8.18 0
print (df2.dtypes)
A object
B float64
C int64
dtype: object
最好在method='ffill
中使用reindex
:
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index, method='ffill')
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1
2016-10-08 13:58:00 in 5.61 1
2016-10-08 13:59:00 in 5.61 1
2016-10-08 14:00:00 in 5.61 1
2016-10-08 14:01:00 in 5.61 1
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:03:00 in 8.05 1
2016-10-08 14:04:00 in 8.05 1
2016-10-08 14:05:00 in 8.05 1
2016-10-08 14:06:00 in 8.05 1
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:08:00 in 7.92 0
2016-10-08 14:09:00 in 7.92 0
2016-10-08 14:10:00 in 7.92 0
2016-10-08 14:11:00 in 7.92 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:13:00 in 7.98 0
2016-10-08 14:14:00 in 7.98 0
2016-10-08 14:15:00 in 7.98 0
2016-10-08 14:16:00 in 7.98 0
2016-10-08 14:17:00 out 8.18 0
print (df2.dtypes)
A object
B float64
C int64
dtype: object
如果使用resample
,您可以按unstack
和stack
返回列A
,但遗憾的是float
仍有问题:
df3 = df.set_index('A', append=True)
.unstack()
.resample('Min', fill_method='ffill')
.stack()
.reset_index(level=1)
print (df3)
A B C
DATE_TIME
2016-10-08 13:57:00 in 5.61 1.0
2016-10-08 13:58:00 in 5.61 1.0
2016-10-08 13:59:00 in 5.61 1.0
2016-10-08 14:00:00 in 5.61 1.0
2016-10-08 14:01:00 in 5.61 1.0
2016-10-08 14:02:00 in 8.05 1.0
2016-10-08 14:03:00 in 8.05 1.0
2016-10-08 14:04:00 in 8.05 1.0
2016-10-08 14:05:00 in 8.05 1.0
2016-10-08 14:06:00 in 8.05 1.0
2016-10-08 14:07:00 in 7.92 0.0
2016-10-08 14:08:00 in 7.92 0.0
2016-10-08 14:09:00 in 7.92 0.0
2016-10-08 14:10:00 in 7.92 0.0
2016-10-08 14:11:00 in 7.92 0.0
2016-10-08 14:12:00 in 7.98 0.0
2016-10-08 14:13:00 in 7.98 0.0
2016-10-08 14:14:00 in 7.98 0.0
2016-10-08 14:15:00 in 7.98 0.0
2016-10-08 14:16:00 in 7.98 0.0
2016-10-08 14:17:00 out 8.18 0.0
print (df3.dtypes)
A object
B float64
C float64
dtype: object
我尝试修改之前的answer以转换为`int:
int_cols = df.select_dtypes(['int64']).columns
print (int_cols)
Index(['C'], dtype='object')
index = pd.date_range(df.index[0], df.index[-1], freq="s")
df2 = df.reindex(index)
for col in df2:
if col == int_cols:
df2[col].ffill(inplace=True)
df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
df2[col].interpolate(inplace=True)
else:
df2[col].ffill(inplace=True)
#print (df2)
print (df2.dtypes)
A object
B float64
C int32
dtype: object