Question

我想根据条件LabelId = 1检索以下数据框的不同部分。换句话说，给定以下数据帧：

DF_input：

   eventTime                 velocity     LabelId
1  2017-08-19 12:53:55.050         3        0
2  2017-08-19 12:53:55.100         4        1
3  2017-08-19 12:53:55.150       180        1
4  2017-08-19 12:53:55.200         2        1
5  2017-08-19 12:53:55.250         5        0
6  2017-08-19 12:53:55.050         3        0
7  2017-08-19 12:53:55.100         4        1
8  2017-08-19 12:53:55.150        70        1
9  2017-08-19 12:53:55.200         2        1
10 2017-08-19 12:53:55.250         5        0

DF_output1

   eventTime                 velocity     LabelId 
2  2017-08-19 12:53:55.100         4        1
3  2017-08-19 12:53:55.150       180        1
4  2017-08-19 12:53:55.200         2        1

DF_output_2

eventTime                 velocity     LabelId
7  2017-08-19 12:53:55.100         4        1
8  2017-08-19 12:53:55.150        70        1
9  2017-08-19 12:53:55.200         2        1

我的尝试是使用条件DF_input [“LabelId”] == 1但它返回一个数据帧中的所有行。所以我无法区分这两个子集。

Answer 1

像

这样的东西

l=[ None if df1[df1.LabelId==1].empty  else df1[df1.LabelId==1] for _, df1 in df.groupby(df.LabelId.eq(0).cumsum())]
l
Out[402]: 
[                eventTime  velocity  LabelId
 2  2017-08-1912:53:55.100         4        1
 3  2017-08-1912:53:55.150       180        1
 4  2017-08-1912:53:55.200         2        1,
 None,
                 eventTime  velocity  LabelId
 7  2017-08-1912:53:55.100         4        1
 8  2017-08-1912:53:55.150        70        1
 9  2017-08-1912:53:55.200         2        1,
 None]

新的组密钥详情

df.LabelId.eq(0).cumsum()
Out[398]: 
1     1
2     1
3     1
4     1
5     2
6     3
7     3
8     3
9     3
10    4
Name: LabelId, dtype: int32

Answer 2

如果不是一个大数据框，你可以做一些简单的事情：

Field 1 of 2

Answer 3

这是一种方式，但有点混乱。

from itertools import groupby
import numpy as np

acc = np.cumsum([len(list(g)) for k, g in groupby(df['LabelId'])])

i = [(a, b) for a, b in zip(acc, acc[1:])][::2]

dfs = [df.iloc[m:n, :] for m, n in i]

# [   velocity  LabelId
# 1         4        1
# 2       180        1
# 3         2        1,
#     velocity  LabelId
# 6         4        1
# 7        70        1
# 8         2        1]

Answer 4

你不需要循环，只需要一些具有累积总和的棘手逻辑：

from io import StringIO

import numpy
import pandas

data = StringIO("""\
eventTime                 velocity     LabelId
2017-08-19 12:53:55.050         3        0
2017-08-19 12:53:55.100         4        1
2017-08-19 12:53:55.150       180        1
2017-08-19 12:53:55.200         2        1
2017-08-19 12:53:55.250         5        0
2017-08-19 12:53:55.050         3        0
2017-08-19 12:53:55.100         4        1
2017-08-19 12:53:55.150        70        1
2017-08-19 12:53:55.200         2        1
2017-08-19 12:53:55.250         5        0
""")

df = (
    pandas.read_table(data, sep='\s\s+')
        .assign(diff=lambda df: df['LabelId'].diff())
        .assign(group=lambda df: numpy.where(
            (df['diff'] == 1).cumsum() == (df['diff'].shift(-1) == -1).shift(1).cumsum(),
            0,
            (df['diff'] == 1).cumsum()
        ))
        .query("group > 0")
        .drop(columns='diff')
)

然后用例如，

print(df[df['group'] == 1])

你得到：

                 eventTime  velocity  LabelId  group
1  2017-08-19 12:53:55.100         4        1      1
2  2017-08-19 12:53:55.150       180        1      1
3  2017-08-19 12:53:55.200         2        1      1

选择数据框的不同部分

4 个答案: