Question

我正在使用pd.Series，如下所示：

2013-01-02        NaN
2013-01-03        NaN
2013-01-04        NaN
2013-01-07   1.000000
2013-01-08   1.000000
2013-01-09   1.000000
2013-01-10   1.000000
2013-01-11   1.000000
2013-01-14   1.000000
2013-01-15   1.000000
2013-01-16   1.000000
2013-01-24   1.000000
2013-01-25   1.000000
2013-01-31   1.000000
2013-02-01          0
2013-02-04          0
2013-02-05          0
2013-02-11  -1.000000
2013-02-12  -1.000000
2013-02-13  -1.000000
2013-02-14          0
2013-02-15          0
2013-02-18          0

我想做的是获得这样的系列：

2013-01-02        NaN
2013-01-03        NaN
2013-01-04        NaN
2013-01-07   1.000000
2013-01-08   1.000000
2013-01-09   1.000000
2013-01-10   1.000000
2013-01-11   1.000000
2013-01-14   1.000000
2013-01-15   1.000000
2013-01-16   1.000000
2013-01-24   1.000000
2013-01-25   1.000000
2013-01-31   1.000000
2013-02-01          0
2013-02-04          0
2013-02-05          0
2013-02-11   2.000000
2013-02-12   2.000000
2013-02-13   2.000000
2013-02-14          0
2013-02-15          0
2013-02-18          0

我想对非零和非NaN值的序列进行编号。我无法想出一种矢量化的方法。

Answer 1

这有点棘手，但是在处理连续群集组时会使用经常出现的模式。（实际上我们需要改进对连续组的支持，但这需要改变基础数据结构，这就是为什么还没有人接触它。）

一种方式：

>>> cl = (ser.notnull() & (ser != 0))
>>> labels = ((cl != cl.shift()) & cl).cumsum() * cl + (ser * 0)
>>> labels
2013-01-02   NaN
2013-01-03   NaN
2013-01-04   NaN
2013-01-07     1
2013-01-08     1
2013-01-09     1
2013-01-10     1
2013-01-11     1
2013-01-14     1
2013-01-15     1
2013-01-16     1
2013-01-24     1
2013-01-25     1
2013-01-31     1
2013-02-01     0
2013-02-04     0
2013-02-05     0
2013-02-11     2
2013-02-12     2
2013-02-13     2
2013-02-14     0
2013-02-15     0
2013-02-18     0
dtype: float64

以下是一些解释。（为了保持这个简短，我会抑制很多重复。）

首先，我们要选择我们要标记的值：

>>> cl = (ser.notnull() & (ser != 0))
>>> cl
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-07     True
...
2013-01-31     True
2013-02-01    False
2013-02-04    False
2013-02-05    False
2013-02-11     True
2013-02-12     True
2013-02-13     True
2013-02-14    False
2013-02-15    False
2013-02-18    False
dtype: bool

现在我们通过将每个群集与其自身的移位版本进行比较来找到每个群集的开头：

>>> cl != cl.shift()
2013-01-02     True
2013-01-03    False
2013-01-04    False
2013-01-07     True
2013-01-08    False
2013-01-09    False
...
2013-01-31    False
2013-02-01     True
2013-02-04    False
2013-02-05    False
2013-02-11     True
2013-02-12    False
2013-02-13    False
2013-02-14     True
2013-02-15    False
2013-02-18    False
dtype: bool

但我们只希望开始我们想要编号的群集开始：

>>> (cl != cl.shift()) & cl
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-07     True
2013-01-08    False
...
2013-02-05    False
2013-02-11     True
2013-02-12    False
2013-02-13    False
2013-02-14    False
2013-02-15    False
2013-02-18    False
dtype: bool

当我们得到这些的累积总和时，由于True == 1和False == 0，我们为每个组获得一个新数字：

>>> ((cl != cl.shift()) & cl).cumsum()
2013-01-02    0
2013-01-03    0
2013-01-04    0
2013-01-07    1
2013-01-08    1
...
2013-02-05    1
2013-02-11    2
2013-02-12    2
2013-02-13    2
2013-02-14    2
2013-02-15    2
2013-02-18    2
dtype: int64

但我们不想为他们不属于群集的群体编号：

>>> ((cl != cl.shift()) & cl).cumsum() * cl
2013-01-02    0
2013-01-03    0
2013-01-04    0
2013-01-07    1
2013-01-08    1
...
2013-01-31    1
2013-02-01    0
2013-02-04    0
2013-02-05    0
2013-02-11    2
2013-02-12    2
2013-02-13    2
2013-02-14    0
2013-02-15    0
2013-02-18    0
dtype: int64

最后，我们希望保留原始的NaN：

>>> ((cl != cl.shift()) & cl).cumsum() * cl + (ser * 0)
2013-01-02   NaN
2013-01-03   NaN
2013-01-04   NaN
2013-01-07     1
2013-01-08     1
...
2013-01-31     1
2013-02-01     0
2013-02-04     0
2013-02-05     0
2013-02-11     2
2013-02-12     2
2013-02-13     2
2013-02-14     0
2013-02-15     0
2013-02-18     0
dtype: float64

Pandas标签序列为0-1值系列中的1

1 个答案: