Question

问题描述

我正在寻找一种有效的方法来识别pandas Index对象中连续重复相同值的所有子范围。

示例问题

作为一个简单示例，请考虑以下pandas Index对象：

import pandas as pd
idx = pd.Index(['X', 'C', 'C', 'C', 'Q', 'Q', 'Q', 'Q', 'A', 'P', 'P'])

在此示例中，值C从位置1到3重复，值Q从位置4到7重复，值P从位置9到10重复我之后尝试获得的结果是这样的元组列表（或类似的东西）：

[(1, 3, 'C'), (4, 7, 'Q'), (9, 10, 'P')]

到目前为止已经尝试了

我一直在试验pandas.Index.duplicated财产，但仅凭这一点，我就无法取得预期的成果。

编辑：

非常感谢大家的回答。我有一个后续问题。假设Index还包含非连续的重复值，例如此示例（其中值X出现多次）：

idx = pd.Index(['X', 'C', 'C', 'C', 'Q', 'Q', 'Q', 'Q', 'X', 'P', 'P'])

如何获得忽略X值的结果？即如何获得此示例的以下结果：

[(1, 3, 'C'), (4, 7, 'Q'), (9, 10, 'P')]

Answer 1

这是一种方式：

this.myForm.statusChanges.subscribe(res => {
    if (res === 'VALID') {
       // emit(this.myForm.values)
    }
});

Answer 2

原始问题

idx = pd.Index(['X', 'C', 'C', 'C', 'Q', 'Q', 'Q', 'Q', 'A', 'P', 'P'])。

有点不同寻常但应该有效，而且似乎也要快得多：

# Get a new Index which is the unique duplicated values in `idx`
un = idx[idx.duplicated(keep=False)].unique()

# Call `get_loc` on `idx` for each member of `un` above  
# `np.where` gets position of True in boolean Index
res = []
for i in un:
    w = np.where(idx.get_loc(i))[0]
    # w[0], w[-1] analogous to v.min(), v.max() from @MaxU's answer
    res.append((w[0], w[-1], i))

print(res)
# [(1, 3, 'C'), (4, 7, 'Q'), (9, 10, 'P')]

定时：

%timeit myanswer()
105 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit maxu()
1.21 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

未注释：

un = idx[idx.duplicated(keep=False)].unique()
res = []
for i in un:
    w = np.where(idx.get_loc(i))[0]
    res.append((w[0], w[-1], i))

编辑问题

idx = pd.Index(['X', 'C', 'C', 'C', 'Q', 'Q', 'Q', 'Q', 'X', 'P', 'P'])。

要到达un，首先得到一个布尔索引，当值等于它之前或之后的值时为True，否则为False。这与第一部分中的idx.duplicated(keep=False)类似。

b = (Series(idx).shift() == idx) | (Series(idx).shift(-1) == idx)
un = idx[b].unique()
# Rest should be the same

熊猫指数：识别连续重复相同值的子范围

问题描述

示例问题

到目前为止已经尝试了

编辑：

2 个答案:

原始问题

编辑问题