如何用NAN值拆分熊猫时间序列

时间:2014-01-28 10:02:30

标签: python numpy split pandas time-series

我有一个像这样的pandas TimeSeries:

2007-02-06 15:00:00    0.780
2007-02-06 16:00:00    0.125
2007-02-06 17:00:00    0.875
2007-02-06 18:00:00      NaN
2007-02-06 19:00:00    0.565
2007-02-06 20:00:00    0.875
2007-02-06 21:00:00    0.910
2007-02-06 22:00:00    0.780
2007-02-06 23:00:00      NaN
2007-02-07 00:00:00      NaN
2007-02-07 01:00:00    0.780
2007-02-07 02:00:00    0.580
2007-02-07 03:00:00    0.880
2007-02-07 04:00:00    0.791
2007-02-07 05:00:00      NaN   

我希望每次在一行中出现一个或多个NaN值时拆分pandas TimeSeries。我的目标是分开事件。

Event1:
2007-02-06 15:00:00    0.780
2007-02-06 16:00:00    0.125
2007-02-06 17:00:00    0.875

Event2:
2007-02-06 19:00:00    0.565
2007-02-06 20:00:00    0.875
2007-02-06 21:00:00    0.910
2007-02-06 22:00:00    0.780

我可以遍历每一行但是还有一种聪明的方法吗?

3 个答案:

答案 0 :(得分:7)

您可以使用numpy.split,然后过滤结果列表。下面是一个示例,假设具有值的列标记为"value"

events = np.split(df, np.where(np.isnan(df.value))[0])
# removing NaN entries
events = [ev[~np.isnan(ev.value)] for ev in events if not isinstance(ev, np.ndarray)]
# removing empty DataFrames
events = [ev for ev in events if not ev.empty]

您将拥有一个列表,其中包含由NaN值分隔的所有事件。

答案 1 :(得分:4)

我找到了一个非常大且稀疏的数据集的有效解决方案。在我的例子中,成千上万行,NaN值之间只有十几个简短的数据段。我(ab)使用了pandas.SparseIndex的内部,这是一个有助于在内存中压缩稀疏数据集的功能。

给出一些数据:

import pandas as pd
import numpy as np

# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)

# Three blocks of non-null data throughout timeseries
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)

看起来像:

2011-01-01 00:00:00   NaN
2011-01-01 00:00:01   NaN
2011-01-01 00:00:02   NaN
2011-01-01 00:00:03   NaN
                       ..
2011-01-10 23:59:56   NaN
2011-01-10 23:59:57   NaN
2011-01-10 23:59:58   NaN
2011-01-10 23:59:59   NaN
Freq: S, Length: 864000, dtype: float64

我们可以高效,轻松地找到这些块:

# Convert to sparse then query index to find block locations
sparse_ts = dense_ts.to_sparse()
block_locs = zip(sparse_ts.sp_index.blocs, sparse_ts.sp_index.blengths)

# Map the sparse blocks back to the dense timeseries
blocks = [dense_ts.iloc[start:(start + length - 1)] for (start, length) in block_locs]

瞧:

[2011-01-01 00:08:20    0.531793
 2011-01-01 00:08:21    0.484391
 2011-01-01 00:08:22    0.022686
 2011-01-01 00:08:23   -0.206495
 2011-01-01 00:08:24    1.472209
 2011-01-01 00:08:25   -1.261940
 2011-01-01 00:08:26   -0.696388
 2011-01-01 00:08:27   -0.219316
 2011-01-01 00:08:28   -0.474840
 Freq: S, dtype: float64, 2011-01-01 03:20:00   -0.147190
 2011-01-01 03:20:01    0.299565
 2011-01-01 03:20:02   -0.846878
 2011-01-01 03:20:03   -0.100975
 2011-01-01 03:20:04    1.288872
 2011-01-01 03:20:05   -0.092474
 2011-01-01 03:20:06   -0.214774
 2011-01-01 03:20:07   -0.540479
 2011-01-01 03:20:08   -0.661083
 2011-01-01 03:20:09    1.129878
 2011-01-01 03:20:10    0.791373
 2011-01-01 03:20:11    0.119564
 2011-01-01 03:20:12    0.345459
 2011-01-01 03:20:13   -0.272132
 Freq: S, dtype: float64, 2011-01-01 05:33:20    1.028268
 2011-01-01 05:33:21    1.476468
 2011-01-01 05:33:22    1.308881
 2011-01-01 05:33:23    1.458202
 2011-01-01 05:33:24   -0.874308
                              ..
 2011-01-01 05:34:02    0.941446
 2011-01-01 05:34:03   -0.996767
 2011-01-01 05:34:04    1.266660
 2011-01-01 05:34:05   -0.391560
 2011-01-01 05:34:06    1.498499
 2011-01-01 05:34:07   -0.636908
 2011-01-01 05:34:08    0.621681
 Freq: S, dtype: float64]

答案 2 :(得分:3)

对于任何正在寻找 bloudermilk 答案的非弃用 (pandas>=0.25.0) 版本的人,在对 pandas sparse source code 进行一些挖掘之后,我想出了以下内容。我尽量让它与他们的答案尽可能相似,以便您进行比较:

给定一些数据:

import pandas as pd
import numpy as np

# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')

# NaN data interspersed with 3 blocks of non-NaN data
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)

看起来像:

2011-01-01 00:00:00   NaN
2011-01-01 00:00:01   NaN
2011-01-01 00:00:02   NaN
2011-01-01 00:00:03   NaN
2011-01-01 00:00:04   NaN
                       ..
2011-01-10 23:59:55   NaN
2011-01-10 23:59:56   NaN
2011-01-10 23:59:57   NaN
2011-01-10 23:59:58   NaN
2011-01-10 23:59:59   NaN
Freq: S, Length: 864000, dtype: float64

我们可以高效轻松地找到块:

# Convert to sparse then query index to find block locations
# different way of converting to sparse in pandas>=0.25.0
sparse_ts = dense_ts.astype(pd.SparseDtype('float'))
# we need to use .values.sp_index.to_block_index() in this version of pandas
block_locs = zip(
    sparse_ts.values.sp_index.to_block_index().blocs,
    sparse_ts.values.sp_index.to_block_index().blengths,
)
# Map the sparse blocks back to the dense timeseries
blocks = [
    dense_ts.iloc[start : (start + length - 1)]
    for (start, length) in block_locs
]

> blocks
[2011-01-01 00:08:20    0.092338
 2011-01-01 00:08:21    1.196703
 2011-01-01 00:08:22    0.936586
 2011-01-01 00:08:23   -0.354768
 2011-01-01 00:08:24   -0.209642
 2011-01-01 00:08:25   -0.750103
 2011-01-01 00:08:26    1.344343
 2011-01-01 00:08:27    1.446148
 2011-01-01 00:08:28    1.174443
 Freq: S, dtype: float64,
 2011-01-01 03:20:00    1.327026
 2011-01-01 03:20:01   -0.431162
 2011-01-01 03:20:02   -0.461407
 2011-01-01 03:20:03   -1.330671
 2011-01-01 03:20:04   -0.892480
 2011-01-01 03:20:05   -0.323433
 2011-01-01 03:20:06    2.520965
 2011-01-01 03:20:07    0.140757
 2011-01-01 03:20:08   -1.688278
 2011-01-01 03:20:09    0.856346
 2011-01-01 03:20:10    0.013968
 2011-01-01 03:20:11    0.204514
 2011-01-01 03:20:12    0.287756
 2011-01-01 03:20:13   -0.727400
 Freq: S, dtype: float64,
 2011-01-01 05:33:20   -1.409744
 2011-01-01 05:33:21    0.338251
 2011-01-01 05:33:22    0.215555
 2011-01-01 05:33:23   -0.309874
 2011-01-01 05:33:24    0.753737
 2011-01-01 05:33:25   -0.349966
 2011-01-01 05:33:26    0.074758
 2011-01-01 05:33:27   -1.574485
 2011-01-01 05:33:28    0.595844
 2011-01-01 05:33:29   -0.670004
 2011-01-01 05:33:30    1.655479
 2011-01-01 05:33:31   -0.362853
 2011-01-01 05:33:32    0.167355
 2011-01-01 05:33:33    0.703780
 2011-01-01 05:33:34    2.633756
 2011-01-01 05:33:35    1.898891
 2011-01-01 05:33:36   -1.129365
 2011-01-01 05:33:37   -0.765057
 2011-01-01 05:33:38    0.279869
 2011-01-01 05:33:39    1.388705
 2011-01-01 05:33:40   -1.420761
 2011-01-01 05:33:41    0.455692
 2011-01-01 05:33:42    0.367106
 2011-01-01 05:33:43    0.856598
 2011-01-01 05:33:44    1.920748
 2011-01-01 05:33:45    0.648581
 2011-01-01 05:33:46   -0.606784
 2011-01-01 05:33:47   -0.246285
 2011-01-01 05:33:48   -0.040520
 2011-01-01 05:33:49    1.422764
 2011-01-01 05:33:50   -1.686568
 2011-01-01 05:33:51    1.282430
 2011-01-01 05:33:52    1.358482
 2011-01-01 05:33:53   -0.998765
 2011-01-01 05:33:54   -0.009527
 2011-01-01 05:33:55    0.647671
 2011-01-01 05:33:56   -1.098435
 2011-01-01 05:33:57   -0.638245
 2011-01-01 05:33:58   -1.820668
 2011-01-01 05:33:59    0.768250
 2011-01-01 05:34:00   -1.029975
 2011-01-01 05:34:01   -0.744205
 2011-01-01 05:34:02    1.627130
 2011-01-01 05:34:03    2.058689
 2011-01-01 05:34:04   -1.194971
 2011-01-01 05:34:05    1.293214
 2011-01-01 05:34:06    0.029523
 2011-01-01 05:34:07   -0.405785
 2011-01-01 05:34:08    0.837123
 Freq: S, dtype: float64]