使用Python 3.6和Pandas 0.19.2
我有一个像这样的DataFrame:
tid datetime event data
0 0 2017-03-22 10:59:59.864 START NaN
1 0 2017-03-22 10:59:59.931 END NaN
2 0 2017-03-22 10:59:59.935 START NaN
3 1 2017-03-22 10:59:59.939 END NaN
4 0 2017-03-22 10:59:59.940 END NaN
5 1 2017-03-22 10:59:59.941 START NaN
6 1 2017-03-22 10:59:59.945 END NaN
7 0 2017-03-22 10:59:59.947 START NaN
8 1 2017-03-22 10:59:59.955 START NaN
包含在线程内发生的事务的开始日期和结束日期(tid是线程ID)。遗憾的是,交易本身没有唯一的ID。所以我需要按 tid 对这些行进行分组,按日期排序,然后按2行2,以便每个事务有1个START和1个END。
我当前的问题是我的初始数据帧可能会错过每个线程的第一个START事件(在上面的示例中,索引为3的行是没有先前START的END事件)。我需要删除那些END行,但我不知道该怎么做。 对于没有匹配END行的最后START行也存在同样的问题。
示例输入
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''tid;datetime;event
0;2017-03-22 10:59:59.864;START
0;2017-03-22 10:59:59.931;END
0;2017-03-22 10:59:59.935;START
1;2017-03-22 10:59:59.939;END
0;2017-03-22 10:59:59.940;END
1;2017-03-22 10:59:59.941;START
1;2017-03-22 10:59:59.945;END
0;2017-03-22 10:59:59.947;START
1;2017-03-22 10:59:59.955;START'''), sep=';', parse_dates=['datetime'])
预期输出
相同的数据帧,但第2行被删除,因为它是Tid 1的第一行而不是START事件:
tid datetime event
0 0 2017-03-22 10:59:59.864 START
1 0 2017-03-22 10:59:59.931 END
3 1 2017-03-22 10:59:59.933 START
4 1 2017-03-22 10:59:59.945 END
5 0 2017-03-22 10:59:59.947 START
6 0 2017-03-22 10:59:59.955 END
顺便说一句,如果你最终得到类似的话,可以获得奖励积分:
tid start_datetime stop_datetime
0 0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1 1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2 0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955
我尝试了什么
df.sort(['tid', 'datetime']).groupby('tid').first().event == 'END'
不包含我的数据框中的初始索引,因此我无法使用它来删除行。 (或者,如果我可以,那么如何做到这一点并不明显)
答案 0 :(得分:1)
一种方法是(我们可以整理自定义函数来处理更多不同的输入,但这适用于示例输入。):
df = df.assign(group=(df.tid.diff().fillna(0) != 0).cumsum())
def myTwo(x):
starttime = x.query('event == "START"')['datetime'].min()
endtime = x.query('event == "END"')['datetime'].max()
tid = x.tid.max()
return pd.Series({'tid':tid,'start_datetime':starttime,'end_datetime':endtime})
print(df.groupby('group').apply(myTwo)[['tid','start_datetime','end_datetime']])
输出:
tid start_datetime end_datetime
group
0 0 2017-03-22 10:59:59.864000 2017-03-22 10:59:59.931000
1 1 2017-03-22 10:59:59.933000 2017-03-22 10:59:59.945000
2 0 2017-03-22 10:59:59.947000 2017-03-22 10:59:59.955000
答案 1 :(得分:1)
您可以使用shift
+ cumsum
为分组创建唯一的Series
,然后使用自定义函数,按query
和iat
进行选择,最后重新排序列reindex_axis
:
a = (df.tid != df.tid.shift()).cumsum()
def f(x):
start = x.query('event == "START"')['datetime'].iat[0]
end = x.query('event == "END"')['datetime'].iat[-1]
tid = x.tid.iat[0]
return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})
print(df.groupby(a, as_index=False).apply(f)
.reindex_axis(['tid','start_datetime','end_datetime'], 1))
tid start_datetime end_datetime
0 0 2017-03-22 10:59:59.864000 2017-03-22 10:59:59.931000
1 1 2017-03-22 10:59:59.933000 2017-03-22 10:59:59.945000
2 0 2017-03-22 10:59:59.947000 2017-03-22 10:59:59.955000
使用boolean indexing
代替query
的另一种解决方案(可能更快,query
在较大的df
中效果更好:
a = (df.tid != df.tid.shift()).cumsum()
def f(x):
start = x.loc[df.event == "START", 'datetime'].iat[0]
end = x.loc[df.event == "END", 'datetime'].iat[-1]
tid = x.tid.iat[0]
return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})
print(df.groupby(a, as_index=False).apply(f)
.reindex_axis(['tid','start_datetime','end_datetime'], 1))
tid start_datetime end_datetime
0 0 2017-03-22 10:59:59.864000 2017-03-22 10:59:59.931000
1 1 2017-03-22 10:59:59.933000 2017-03-22 10:59:59.945000
2 0 2017-03-22 10:59:59.947000 2017-03-22 10:59:59.955000
答案 2 :(得分:1)
这是另一种方法,采用基于this answer的groupby()
策略:
# make boolean mask to check for valid event entries
def valid_event(x):
if x.name:
return df.loc[x.name-1,'event']==x.event
return False
mask = df.apply(check_event, axis='columns')
# subset with mask
df = (df.loc[~mask]
.groupby(np.arange(len(df2))//2) # groupby every 2 rows
.agg({'tid':{'tid':'first'},
'datetime':{'start_datetime':'min',
'stop_datetime':'max'}
})
)
df.columns = df.columns.droplevel() # drop multi-index cols
print(df)
tid start_datetime stop_datetime
0 0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1 1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2 0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955
答案 3 :(得分:0)
我设法通过这种方式部分解决了我的问题:
# order events by thread id and datetime
df = df.sort_values(['tid', 'datetime']).reset_index(drop=True)
# then group by tid
for tid, group in df.groupby('tid'):
# for each group, drop the first line if it is a END event
head = group.head(1).iloc[0]
if head.status == 'END':
df.drop(head.name, inplace=True)
# and drop the last line if it is a START event
tail = group.tail(1).iloc[0]
if tail.status == 'START':
df.drop(tail.name, inplace=True)
# take lines 2 by 2, that will be a START and an END event, that can be aggregated
df.groupby(np.arange(len(df)) // 2).agg({'Tid': 'first', 'DateTime': {'start': 'min', 'stop': 'max'}})