如何提取具有相同值的线段?

时间:2019-04-30 03:43:01

标签: python pandas dataframe

我有以下数据框

df = pd.DataFrame({'col1':range(20), 'col2': list(range(3)) + [5] *3 +list(range(3)) + [3]*3 + list(range(4)) + [2]*3 + [4] }, 
        index = pd.date_range('1/1/2000', periods=20, freq='1S'))

df
Out[115]: 
                     col1  col2
2000-01-01 00:00:00     0     0
2000-01-01 00:00:01     1     1
2000-01-01 00:00:02     2     2
2000-01-01 00:00:03     3     5 *
2000-01-01 00:00:04     4     5 *
2000-01-01 00:00:05     5     5 *
2000-01-01 00:00:06     6     0
2000-01-01 00:00:07     7     1
2000-01-01 00:00:08     8     2
2000-01-01 00:00:09     9     3 *
2000-01-01 00:00:10    10     3 *
2000-01-01 00:00:11    11     3 *
2000-01-01 00:00:12    12     0
2000-01-01 00:00:13    13     1
2000-01-01 00:00:14    14     2
2000-01-01 00:00:15    15     3
2000-01-01 00:00:16    16     2 *
2000-01-01 00:00:17    17     2 *
2000-01-01 00:00:18    18     2 *
2000-01-01 00:00:19    19     4

从上面可以看到,我在col2中有三个具有相同值的片段,我想将这三个片段提取出来:

                       col1  col2
2000-01-01 00:00:03     3     5
2000-01-01 00:00:04     4     5
2000-01-01 00:00:05     5     5


                       col1  col2
2000-01-01 00:00:09     9     3
2000-01-01 00:00:10    10     3
2000-01-01 00:00:11    11     3

                       col1  col2
2000-01-01 00:00:16    16     2
2000-01-01 00:00:17    17     2
2000-01-01 00:00:18    18     2

我该如何实现?

2 个答案:

答案 0 :(得分:2)

这是使用diffcumsum创建不同组的一种方法,然后我们使用transformcount来获取组计数,并选择等于3的计数,最后我们只需要groupby并将数据帧除以col2

s=df.col2.diff().ne(0).cumsum()
l=[y for x , y  in df[s.groupby(s).transform('count')==3].groupby('col2')]
l[0]
Out[205]: 
                     col1  col2
2000-01-01 00:00:16    16     2
2000-01-01 00:00:17    17     2
2000-01-01 00:00:18    18     2

答案 1 :(得分:1)

这是我的看法:

df = pd.DataFrame({'col1':range(20), 'col2': list(range(3)) + [5] *3 +list(range(3)) + [3]*3 + list(range(4)) + [2]*3 + [4] }, 
        index = pd.date_range('1/1/2000', periods=20, freq='1S'))

# create markers for equal segment
df['markers'] = ((df.col2==df.col2.shift(-1)) & (df.col2 == df.col2.shift(-2))).cumsum()

# drop the first lines:
new_df = df[df['markers'] > 0].copy()

# output:
new_df.groupby('markers')[['col1','col2']].apply(lambda x: x[:3])

输出:

+----------+----------------------+-------+------+
|          |                      | col1  | col2 |
+----------+----------------------+-------+------+
| markers  |                      |       |      |
+----------+----------------------+-------+------+
| 1        | 2000-01-01 00:00:03  |    3  |    5 |
|          | 2000-01-01 00:00:04  |    4  |    5 |
|          | 2000-01-01 00:00:05  |    5  |    5 |
| 2        | 2000-01-01 00:00:09  |    9  |    3 |
|          | 2000-01-01 00:00:10  |   10  |    3 |
|          | 2000-01-01 00:00:11  |   11  |    3 |
| 3        | 2000-01-01 00:00:16  |   16  |    2 |
|          | 2000-01-01 00:00:17  |   17  |    2 |
|          | 2000-01-01 00:00:18  |   18  |    2 |
+----------+----------------------+-------+------+