Question

我有一些时间序列，其中的差距存储为pd.Series。如何有效地获得“最后一个不间断”的数据点序列（即不包含任何NaN值）？

我原来的系列可能是这样的：

2014-12-01    500
2015-02-01    700
2015-03-01    700
dtype: float64

我可以使用pd.Series.asfreq轻松地将此系列转换为常规系列，例如series.asfreq('MS')给出：

2014-12-01    500
2015-01-01    NaN
2015-02-01    700
2015-03-01    700
dtype: float64

在这种情况下，我想从2015-02-01及以后获得该系列：

2015-02-01    700
2015-03-01    700
dtype: float64

这是我想出来的，但看起来很难看：

# Let i be the first position we're getting, default to entire series
i = 0

# Find any NaN values in the Series
nan_index = series[series.isnull()].index
if len(nan_index):
    # Find the position of the last null value in the original
    # series (+ 1 to skip it)
    i = series.index.get_loc(nan_index[-1]) + 1

series.iloc[i:]

Answer 1

一个可能的技巧是查找非空的索引，以及空条目的cumsum与空条目的总和匹配的位置。然后，这可以通过花哨的索引来完成。

这只是一种聪明的伎俩＆＃39; Dijkstra可能会告诉我们所有人都要避免，因为它不可读并且可能会被巧妙地破坏（例如，这假设索引按照您的需要提前排序）。我不认为更简洁但更直接的解决方案有任何问题，例如直接计算最终Null的索引，除非您可以对其进行分析并确定这是一个主要的性能问题。

In [35]: s
Out[35]: 
2014-12-01    500
2015-02-01    700
2015-03-01    700
dtype: int64

In [36]: s_ms = s.asfreq('MS')

In [37]: s_ms_null = s_ms.isnull()

In [38]: s[~s_ms_null & (s_ms_null.cumsum() == s_ms_null.sum())]
Out[38]: 
2015-02-01    700
2015-03-01    700
dtype: int64

获得熊猫系列的最后一个完整序列

1 个答案: