如果我的df
与此类似:
print(df)
A B C D E
DATE_TIME
2016-08-10 13:57:00 3.6 A 1 NaN NaN
2016-08-10 13:58:00 4.7 A 1 4.5 NaN
2016-08-10 13:59:00 3.4 A 0 NaN 5.7
2016-08-10 14:00:00 3.5 A 0 NaN NaN
2016-08-10 14:01:00 2.6 A 0 4.6 NaN
2016-08-10 14:02:00 4.8 A 0 NaN 4.3
2016-08-10 14:03:00 5.7 A 1 NaN NaN
2016-08-10 14:04:00 5.5 A 1 5.7 NaN
2016-08-10 14:05:00 5.6 A 1 NaN NaN
2016-08-10 14:06:00 7.8 A 1 NaN 5.2
2016-08-10 14:07:00 8.9 A 0 NaN NaN
2016-08-10 14:08:00 3.6 A 0 NaN NaN
print (df.dtypes)
A float64
B object
C int64
D float64
E float64
dtype: object
感谢来自社区的大量意见,我现在有了这个代码,它允许我将我的df上采样到第二个间隔,将不同的方法应用于不同的dtypes
int_cols = df.select_dtypes(['int64']).columns
index = pd.date_range(df.index[0], df.index[-1], freq="s")
df2 = df.reindex(index)
for col in df2:
if col == int_cols.all():
df2[col].ffill(inplace=True)
df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
df2[col].interpolate(inplace=True)
else:
df2[col].ffill(inplace=True)
我正在寻找一种方法,只在实际测量之间进行插值。插值函数将我的上一次测量延伸到df
:
df2.tail()
Out[75]:
A B C D E
2016-08-10 14:07:56 3.953333 A 0 5.7 5.2
2016-08-10 14:07:57 3.865000 A 0 5.7 5.2
2016-08-10 14:07:58 3.776667 A 0 5.7 5.2
2016-08-10 14:07:59 3.688333 A 0 5.7 5.2
2016-08-10 14:08:00 3.600000 A 0 5.7 5.2
但我想在最后一次测量时(例如在14:04:00 col['D']
和14:06:00 col['D']
)停止此操作并离开NaN。
尝试将'limit'和'limit_direction'的零值添加到'both':
for col in df2:
if col == int_cols.all():
df2[col].ffill(inplace=True)
df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
df2[col].interpolate(inplace=True,limit=0, limit_direction='both')
else:
df2[col].ffill(inplace=True)
但这并没有改变输出。我试图将我在这个问题中找到的解决方案:Pandas: interpolation where first and last data point in column is NaN合并到我的代码中:
for col in df2:
if col == int_cols.all():
df2[col].ffill(inplace=True)
df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
df2[col].loc[df2[col].first_valid_index(): df2[col].last_valid_index()]=df2[col].loc[df2[col].first_valid_index(): df2[col].last_valid_index()].astype(float).interpolate(inplace=True)
else:
df2[col].ffill(inplace=True)
...但是这不起作用,我的float64
列现在纯粹是NaN ...而且,我尝试插入代码的方式,我知道它只会影响float
列。在理想的解决方案中,我希望将此first_valid_index():.last_valid_index()
选项设置为object
和int64
列。有人能帮助我吗? ..谢谢你
答案 0 :(得分:4)
对于pandas 0.23.0
,可以在limit_area
中使用参数interpolate
:
df = pd.DataFrame({'A': [np.nan, 1.0, np.nan, np.nan, 4.0, np.nan, np.nan],
'B': [np.nan, np.nan, 0.0, np.nan, np.nan, 2.0, np.nan]},
columns=['A', 'B'],
index=pd.date_range(start='2016-08-10 13:50:00', periods=7, freq='S'))
print (df)
A B
2016-08-10 13:50:00 NaN NaN
2016-08-10 13:50:01 1.0 NaN
2016-08-10 13:50:02 NaN 0.0
2016-08-10 13:50:03 NaN NaN
2016-08-10 13:50:04 4.0 NaN
2016-08-10 13:50:05 NaN 2.0
2016-08-10 13:50:06 NaN NaN
df = df.interpolate(limit_direction='both', limit_area='inside')
print (df)
A B
2016-08-10 13:50:00 NaN NaN
2016-08-10 13:50:01 1.0 NaN
2016-08-10 13:50:02 2.0 0.000000
2016-08-10 13:50:03 3.0 0.666667
2016-08-10 13:50:04 4.0 1.333333
2016-08-10 13:50:05 NaN 2.000000
2016-08-10 13:50:06 NaN NaN
答案 1 :(得分:2)
你非常接近!以下是一个示例,可以说明您在帖子末尾发布的代码非常相似:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [np.nan, 1.0, np.nan, np.nan, 4.0, np.nan, np.nan],
'B': [np.nan, np.nan, 0.0, np.nan, np.nan, 2.0, np.nan]},
columns=['A', 'B'],
index=pd.date_range(start='2016-08-10 13:50:00', periods=7, freq='S'))
print df
A_first = df['A'].first_valid_index()
A_last = df['A'].last_valid_index()
df.loc[A_first:A_last, 'A'] = df.loc[A_first:A_last, 'A'].interpolate()
B_first = df['B'].first_valid_index()
B_last = df['B'].last_valid_index()
df.loc[B_first:B_last, 'B'] = df.loc[B_first:B_last, 'B'].interpolate()
print df
结果:
A B
2016-08-10 13:50:00 NaN NaN
2016-08-10 13:50:01 1.0 NaN
2016-08-10 13:50:02 NaN 0.0
2016-08-10 13:50:03 NaN NaN
2016-08-10 13:50:04 4.0 NaN
2016-08-10 13:50:05 NaN 2.0
2016-08-10 13:50:06 NaN NaN
A B
2016-08-10 13:50:00 NaN NaN
2016-08-10 13:50:01 1.0 NaN
2016-08-10 13:50:02 2.0 0.000000
2016-08-10 13:50:03 3.0 0.666667
2016-08-10 13:50:04 4.0 1.333333
2016-08-10 13:50:05 NaN 2.000000
2016-08-10 13:50:06 NaN NaN
代码中的两个问题是:
df[...] = df[...].interpolate()
,你需要
删除inplace=True
,因为这将使其返回None
。这是你的主要问题以及为什么你得到所有NaNs
。 你想:
df.loc[A_first:A_last, 'A'] = df.loc[A_first:A_last, 'A'].interpolate()
不是:
df['A'].loc[A_first:A_last] = df['A'].loc[A_first:A_last].interpolate()
有关详情,请参阅此处:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
答案 2 :(得分:1)
您可以回填空值,然后使用布尔索引来获取每列的空值(必须是尾部空值)。
for col in ['D', 'E']:
idx = df[df[col].bfill().isnull()].index
df[col].ffill(inplace=True)
df.loc[idx, col] = None