python pandas:conditionnaly删除每个组的第一行

时间:2017-04-28 07:34:39

标签: python pandas

使用Python 3.6和Pandas 0.19.2

我有一个像这样的DataFrame:

   tid                datetime  event  data
0    0 2017-03-22 10:59:59.864  START   NaN
1    0 2017-03-22 10:59:59.931    END   NaN
2    0 2017-03-22 10:59:59.935  START   NaN
3    1 2017-03-22 10:59:59.939    END   NaN
4    0 2017-03-22 10:59:59.940    END   NaN
5    1 2017-03-22 10:59:59.941  START   NaN
6    1 2017-03-22 10:59:59.945    END   NaN
7    0 2017-03-22 10:59:59.947  START   NaN
8    1 2017-03-22 10:59:59.955  START   NaN

包含在线程内发生的事务的开始日期和结束日期(tid是线程ID)。遗憾的是,交易本身没有唯一的ID。所以我需要按 tid 对这些行进行分组,按日期排序,然后按2行2,以便每个事务有1个START和1个END。

我当前的问题是我的初始数据帧可能会错过每个线程的第一个START事件(在上面的示例中,索引为3的行是没有先前START的END事件)。我需要删除那些END行,但我不知道该怎么做。 对于没有匹配END行的最后START行也存在同样的问题。

示例输入

import pandas as pd
import io
df = pd.read_csv(io.StringIO('''tid;datetime;event
0;2017-03-22 10:59:59.864;START
0;2017-03-22 10:59:59.931;END
0;2017-03-22 10:59:59.935;START
1;2017-03-22 10:59:59.939;END
0;2017-03-22 10:59:59.940;END
1;2017-03-22 10:59:59.941;START
1;2017-03-22 10:59:59.945;END
0;2017-03-22 10:59:59.947;START
1;2017-03-22 10:59:59.955;START'''), sep=';', parse_dates=['datetime'])

预期输出

相同的数据帧,但第2行被删除,因为它是Tid 1的第一行而不是START事件:

   tid                datetime  event
0    0 2017-03-22 10:59:59.864  START
1    0 2017-03-22 10:59:59.931    END
3    1 2017-03-22 10:59:59.933  START
4    1 2017-03-22 10:59:59.945    END
5    0 2017-03-22 10:59:59.947  START
6    0 2017-03-22 10:59:59.955    END
顺便说一句,如果你最终得到类似的话,可以获得奖励积分:

   tid          start_datetime           stop_datetime
0    0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1    1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2    0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955

我尝试了什么

df.sort(['tid', 'datetime']).groupby('tid').first().event == 'END'不包含我的数据框中的初始索引,因此我无法使用它来删除行。 (或者,如果我可以,那么如何做到这一点并不明显)

4 个答案:

答案 0 :(得分:1)

一种方法是(我们可以整理自定义函数来处理更多不同的输入,但这适用于示例输入。):

df = df.assign(group=(df.tid.diff().fillna(0) != 0).cumsum())

def myTwo(x):
    starttime = x.query('event == "START"')['datetime'].min()
    endtime = x.query('event == "END"')['datetime'].max()
    tid = x.tid.max()
    return pd.Series({'tid':tid,'start_datetime':starttime,'end_datetime':endtime})

print(df.groupby('group').apply(myTwo)[['tid','start_datetime','end_datetime']])

输出:

       tid              start_datetime                end_datetime
group                                                             
0        0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1        1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2        0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

答案 1 :(得分:1)

您可以使用shift + cumsum为分组创建唯一的Series,然后使用自定义函数,按queryiat进行选择,最后重新排序列reindex_axis

a = (df.tid != df.tid.shift()).cumsum()

def f(x):
    start = x.query('event == "START"')['datetime'].iat[0]
    end = x.query('event == "END"')['datetime'].iat[-1]
    tid = x.tid.iat[0]
    return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})

print(df.groupby(a, as_index=False).apply(f)
        .reindex_axis(['tid','start_datetime','end_datetime'], 1))

   tid              start_datetime                end_datetime
0    0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1    1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2    0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

使用boolean indexing代替query的另一种解决方案(可能更快,query在较大的df中效果更好:

a = (df.tid != df.tid.shift()).cumsum()

def f(x):
    start = x.loc[df.event == "START", 'datetime'].iat[0]
    end = x.loc[df.event == "END", 'datetime'].iat[-1]

    tid = x.tid.iat[0]
    return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})

print(df.groupby(a, as_index=False).apply(f)
        .reindex_axis(['tid','start_datetime','end_datetime'], 1))
   tid              start_datetime                end_datetime
0    0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1    1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2    0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

答案 2 :(得分:1)

这是另一种方法,采用基于this answergroupby()策略:

# make boolean mask to check for valid event entries
def valid_event(x):
    if x.name:
        return df.loc[x.name-1,'event']==x.event
    return False
mask = df.apply(check_event, axis='columns')

# subset with mask
df = (df.loc[~mask]
        .groupby(np.arange(len(df2))//2) # groupby every 2 rows
        .agg({'tid':{'tid':'first'},
              'datetime':{'start_datetime':'min',
                          'stop_datetime':'max'}
             })
      )

df.columns = df.columns.droplevel() # drop multi-index cols

print(df)

   tid          start_datetime           stop_datetime
0    0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1    1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2    0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955

答案 3 :(得分:0)

我设法通过这种方式部分解决了我的问题:

# order events by thread id and datetime
df = df.sort_values(['tid', 'datetime']).reset_index(drop=True)
# then group by tid
for tid, group in df.groupby('tid'):
     # for each group, drop the first line if it is a END event
     head = group.head(1).iloc[0]
     if head.status == 'END':
         df.drop(head.name, inplace=True)
     # and drop the last line if it is a START event
     tail = group.tail(1).iloc[0]
     if tail.status == 'START':
         df.drop(tail.name, inplace=True)

# take lines 2 by 2, that will be a START and an END event, that can be aggregated
df.groupby(np.arange(len(df)) // 2).agg({'Tid': 'first', 'DateTime': {'start': 'min', 'stop': 'max'}})