根据出现情况对数组元素进行分组,保持顺序,获取第一个和最后一个索引

时间:2019-02-03 13:52:41

标签: python python-3.x pandas numpy

我想知道熊猫中是否有更好的方法可以达到相同的目的:

x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]
x = np.asarray(x)

df = pd.DataFrame(columns=['id', 'start', 'end'])

if len(x) > 1:
    i = 0
    for j in range(1, len(x)):
        if x[j] == x[j-1]:
            continue
        else:
            df.loc[len(df)] = [x[i], i, j-1]
            i = j;
    df.loc[len(df)] = [x[i], i, j]
else:
    df.loc[len(df)] = [x[0], 0, 0]

输出看起来像这样

[1 1 1 2 2 2 3 3 3 5 5 1 1 2 2]
  id start end
0  1     0   2
1  2     3   5
2  3     6   8
3  5     9  10
4  1    11  12
5  2    13  14

感谢有用的提示。

4 个答案:

答案 0 :(得分:3)

这是您可以使用numpy进行操作的一种方式:

x = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2])

# Search for all consecutive non equal values in the array
vals = x[x != np.roll(x, 1)]
# array([1, 2, 3, 5, 1, 2])

# Indices where changes in x occur
d = np.flatnonzero(np.diff(x) != 0)
# array([ 2,  5,  8, 10, 12])

start = np.hstack([0] + [d+1])
# array([ 0,  3,  6,  9, 11, 13])

end = np.hstack([d, len(x)-1])
# array([ 2,  5,  8, 10, 12, 14]) 

pd.DataFrame({'id':vals, 'start':start, 'end':end})

    id  start  end
0   1      0    2
1   2      3    5
2   3      6    8
3   5      9   10
4   1     11   12
5   2     13   14

答案 1 :(得分:3)

您可以仅使用熊猫来执行以下操作:

import numpy as np
import pandas as pd

x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]

s = pd.Series(x)

# store group-by to avoid repetition
groups = s.groupby((s != s.shift()).cumsum())

# get id and size for each group
ids, size = groups.first(), groups.size()

# get start
start = size.cumsum().shift().fillna(0).astype(np.int32)

# get end
end = (start + size - 1)

df = pd.DataFrame({'id': ids, 'start': start, 'end': end}, columns=['id', 'start', 'end'])

print(df)

输出

   id  start  end
1   1      0    2
2   2      3    5
3   3      6    8
4   5      9   10
5   1     11   12
6   2     13   14

答案 2 :(得分:3)

另一种解决方案:

df= pd.DataFrame(data=[1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2],columns=['id'])

g=df.groupby((df.id!=df.id.shift()).cumsum())['id']

df_new=pd.concat([g.first(),g.apply(lambda x: x.duplicated(keep='last').idxmax()),\
           g.apply(lambda x: x.duplicated(keep='last').idxmin())],axis=1)

df_new.columns=['id','start','end']
print(df_new)

    id  start  end
id                
1    1      0    2
2    2      3    5
3    3      6    8
4    5      9   10
5    1     11   12
6    2     13   14

答案 3 :(得分:0)

使用itertools.groupby

import pandas as pd
from itertools import groupby

x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]
l = []
for i in  [list(g) for _,g in groupby(enumerate(x), lambda x:x[1])]:
    l.append( (i[0][1], i[0][0], i[-1][0]) )

print (pd.DataFrame(l, columns=['id','start','end']))

输出:

   id  start  end
0   1      0    2
1   2      3    5
2   3      6    8
3   5      9   10
4   1     11   12
5   2     13   14