Question

如果我的df与此类似：

print(df)
                       A  B  C    D    E
 DATE_TIME                               
2016-08-10 13:57:00  3.6  A  1  NaN  NaN
2016-08-10 13:58:00  4.7  A  1  4.5  NaN
2016-08-10 13:59:00  3.4  A  0  NaN  5.7
2016-08-10 14:00:00  3.5  A  0  NaN  NaN
2016-08-10 14:01:00  2.6  A  0  4.6  NaN
2016-08-10 14:02:00  4.8  A  0  NaN  4.3
2016-08-10 14:03:00  5.7  A  1  NaN  NaN
2016-08-10 14:04:00  5.5  A  1  5.7  NaN
2016-08-10 14:05:00  5.6  A  1  NaN  NaN
2016-08-10 14:06:00  7.8  A  1  NaN  5.2
2016-08-10 14:07:00  8.9  A  0  NaN  NaN
2016-08-10 14:08:00  3.6  A  0  NaN  NaN

print (df.dtypes)
A    float64
B     object
C      int64
D    float64
E    float64
dtype: object

感谢来自社区的大量意见，我现在有了这个代码，它允许我将我的df上采样到第二个间隔，将不同的方法应用于不同的dtypes

int_cols = df.select_dtypes(['int64']).columns
index = pd.date_range(df.index[0], df.index[-1], freq="s")
df2 = df.reindex(index)

for col in df2:
if col == int_cols.all(): 
    df2[col].ffill(inplace=True)
    df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
    df2[col].interpolate(inplace=True)
else:
    df2[col].ffill(inplace=True)

我正在寻找一种方法，只在实际测量之间进行插值。插值函数将我的上一次测量延伸到df：

的结尾

 df2.tail()
Out[75]: 
                            A  B  C    D    E
2016-08-10 14:07:56  3.953333  A  0  5.7  5.2
2016-08-10 14:07:57  3.865000  A  0  5.7  5.2
2016-08-10 14:07:58  3.776667  A  0  5.7  5.2
2016-08-10 14:07:59  3.688333  A  0  5.7  5.2
2016-08-10 14:08:00  3.600000  A  0  5.7  5.2

但我想在最后一次测量时（例如在14:04:00 col['D']和14:06:00 col['D']）停止此操作并离开NaN。

尝试将'limit'和'limit_direction'的零值添加到'both'：

 for col in df2:
if col == int_cols.all(): 
    df2[col].ffill(inplace=True)
    df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
    df2[col].interpolate(inplace=True,limit=0, limit_direction='both')
else:
    df2[col].ffill(inplace=True)

但这并没有改变输出。我试图将我在这个问题中找到的解决方案：Pandas: interpolation where first and last data point in column is NaN合并到我的代码中：

for col in df2:
if col == int_cols.all(): 
    df2[col].ffill(inplace=True)
    df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
   df2[col].loc[df2[col].first_valid_index(): df2[col].last_valid_index()]=df2[col].loc[df2[col].first_valid_index(): df2[col].last_valid_index()].astype(float).interpolate(inplace=True)
else:
    df2[col].ffill(inplace=True)

...但是这不起作用，我的float64列现在纯粹是NaN ...而且，我尝试插入代码的方式，我知道它只会影响float列。在理想的解决方案中，我希望将此first_valid_index():.last_valid_index()选项设置为object和int64列。有人能帮助我吗？ ..谢谢你

Answer 1

对于pandas 0.23.0，可以在limit_area中使用参数interpolate：

df = pd.DataFrame({'A': [np.nan, 1.0, np.nan, np.nan, 4.0, np.nan, np.nan],
                   'B': [np.nan, np.nan, 0.0, np.nan, np.nan, 2.0, np.nan]},
                  columns=['A', 'B'], 
                  index=pd.date_range(start='2016-08-10 13:50:00', periods=7, freq='S'))
print (df)
                       A    B
2016-08-10 13:50:00  NaN  NaN
2016-08-10 13:50:01  1.0  NaN
2016-08-10 13:50:02  NaN  0.0
2016-08-10 13:50:03  NaN  NaN
2016-08-10 13:50:04  4.0  NaN
2016-08-10 13:50:05  NaN  2.0
2016-08-10 13:50:06  NaN  NaN

df = df.interpolate(limit_direction='both', limit_area='inside')
print (df)
                       A         B
2016-08-10 13:50:00  NaN       NaN
2016-08-10 13:50:01  1.0       NaN
2016-08-10 13:50:02  2.0  0.000000
2016-08-10 13:50:03  3.0  0.666667
2016-08-10 13:50:04  4.0  1.333333
2016-08-10 13:50:05  NaN  2.000000
2016-08-10 13:50:06  NaN       NaN

Answer 2

你非常接近！以下是一个示例，可以说明您在帖子末尾发布的代码非常相似：

import numpy as np
import pandas as pd

df = pd.DataFrame({'A': [np.nan, 1.0, np.nan, np.nan, 4.0, np.nan, np.nan],
                   'B': [np.nan, np.nan, 0.0, np.nan, np.nan, 2.0, np.nan]},
                  columns=['A', 'B'], 
                  index=pd.date_range(start='2016-08-10 13:50:00', periods=7, freq='S'))
print df

A_first = df['A'].first_valid_index()
A_last = df['A'].last_valid_index()
df.loc[A_first:A_last, 'A'] = df.loc[A_first:A_last, 'A'].interpolate()

B_first = df['B'].first_valid_index()
B_last = df['B'].last_valid_index()
df.loc[B_first:B_last, 'B'] = df.loc[B_first:B_last, 'B'].interpolate()

print df

结果：

                       A    B
2016-08-10 13:50:00  NaN  NaN
2016-08-10 13:50:01  1.0  NaN
2016-08-10 13:50:02  NaN  0.0
2016-08-10 13:50:03  NaN  NaN
2016-08-10 13:50:04  4.0  NaN
2016-08-10 13:50:05  NaN  2.0
2016-08-10 13:50:06  NaN  NaN

                       A         B
2016-08-10 13:50:00  NaN       NaN
2016-08-10 13:50:01  1.0       NaN
2016-08-10 13:50:02  2.0  0.000000
2016-08-10 13:50:03  3.0  0.666667
2016-08-10 13:50:04  4.0  1.333333
2016-08-10 13:50:05  NaN  2.000000
2016-08-10 13:50:06  NaN       NaN

代码中的两个问题是：

如果你打算做df[...] = df[...].interpolate()，你需要删除inplace=True，因为这将使其返回None。这是你的主要问题以及为什么你得到所有NaNs。
虽然它似乎在这里工作，但一般来说，链式索引很糟糕：

你想：

df.loc[A_first:A_last, 'A'] = df.loc[A_first:A_last, 'A'].interpolate()

不是：

df['A'].loc[A_first:A_last] = df['A'].loc[A_first:A_last].interpolate()

有关详情，请参阅此处：http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Answer 3

您可以回填空值，然后使用布尔索引来获取每列的空值（必须是尾部空值）。

for col in ['D', 'E']:
    idx = df[df[col].bfill().isnull()].index
    df[col].ffill(inplace=True)
    df.loc[idx, col] = None

如何仅使用pandas在值（在列中的最后一个NaN之前和之后停止）之间进行插值？

3 个答案: