我有一个看起来像这样的Pandas数据框:
A B C Stime Etime
1220627 a 10.0 18:00:00 18:09:59
1220627 a 12.0 18:15:00 18:26:59
1220683 b 3.0 18:36:00 18:38:59
1220683 a 3.0 18:36:00 18:38:59
1220732 a 59.0 18:00:00 18:58:59
1220760 A 16.0 18:24:00 18:39:59
1220760 a 16.0 18:24:00 18:39:59
1220760 A 19.0 18:40:00 18:58:59
1220760 b 19.0 18:40:00 18:58:59
1220760 a 19.0 18:40:00 18:58:59
1220775 a 3.0 18:03:00 18:05:59
Stime和Etime cols来自日期时间类型。
C是Stime和Etime之间的分钟数。
col是家庭ID,B col是家庭中的人ID。
(因此cols A和B一起代表一个独特的人)。
我需要做的是更新表格,如果对于某个人,Stime在结束时间之后到来 - 我将单位2行,我将更新C.
例如此处,对于HH a
中的人1220760
,第一个Etime
为18:39:59
而第二个Stime
是18:40:00
- 它是在18:39:59之后发生的,所以我想将这些行单位化并将此人的C更新为35
( 16 + 19)。
我尝试使用groupby
,但我不知道如何在Stime
之后添加Etime
将会出现的条件。
答案 0 :(得分:3)
如果我们向Etime
添加一秒,那么我们可以通过['A', 'B']
分组来查找要加入的行,然后针对每个组将已移位的Etime
与下一个{{1}进行比较}:
Stime
我们希望保留df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
# A B C Etime Stime keep
# 0 1220627 a 10.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True
# 1 1220627 a 12.0 2016-05-29 18:27:00 2016-05-29 18:15:00 True
# 3 1220683 a 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True
# 2 1220683 b 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True
# 4 1220732 a 59.0 2016-05-29 18:59:00 2016-05-29 18:00:00 True
# 5 1220760 A 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True
# 7 1220760 A 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False
# 12 1220760 a 0.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True
# 6 1220760 a 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True
# 9 1220760 a 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False
# 11 1220760 a 11.0 2016-05-29 19:10:00 2016-05-29 18:59:00 False
# 8 1220760 b 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 True
# 10 1220775 a 3.0 2016-05-29 18:06:00 2016-05-29 18:03:00 True
为True的行,并删除keep
为False的行,
除了我们还想要酌情更新keep
。
如果我们可以分配一个"组号"那将是很好的。到每一行,以便我们可以按Etime
分组 - 事实上我们可以。我们需要做的只是将['A', 'B', 'group_number']
应用于cumsum
列:
keep
现在可以通过df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
# A B C Etime Stime keep group_number
# 0 1220627 a 10.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True 1.0
# 1 1220627 a 12.0 2016-05-29 18:27:00 2016-05-29 18:15:00 True 2.0
# 3 1220683 a 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True 1.0
# 2 1220683 b 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True 1.0
# 4 1220732 a 59.0 2016-05-29 18:59:00 2016-05-29 18:00:00 True 1.0
# 5 1220760 A 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True 1.0
# 7 1220760 A 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False 1.0
# 12 1220760 a 0.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True 1.0
# 6 1220760 a 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True 2.0
# 9 1220760 a 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False 2.0
# 11 1220760 a 11.0 2016-05-29 19:10:00 2016-05-29 18:59:00 False 2.0
# 8 1220760 b 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 True 1.0
# 10 1220775 a 3.0 2016-05-29 18:06:00 2016-05-29 18:03:00 True 1.0
分组找到所需的结果,
并找到每个组的最小['A', 'B', 'group_number']
和最大Stime
:
Etime
全部放在一起,
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
Stime Etime
A B group_number
1220627 a 1.0 2016-05-29 18:00:00 2016-05-29 18:10:00
2.0 2016-05-29 18:15:00 2016-05-29 18:27:00
1220683 a 1.0 2016-05-29 18:36:00 2016-05-29 18:39:00
b 1.0 2016-05-29 18:36:00 2016-05-29 18:39:00
1220732 a 1.0 2016-05-29 18:00:00 2016-05-29 18:59:00
1220760 A 1.0 2016-05-29 18:24:00 2016-05-29 18:59:00
a 1.0 2016-05-29 18:00:00 2016-05-29 18:10:00
2.0 2016-05-29 18:24:00 2016-05-29 19:10:00
b 1.0 2016-05-29 18:40:00 2016-05-29 18:59:00
1220775 a 1.0 2016-05-29 18:03:00 2016-05-29 18:06:00
产量
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'A': [1220627, 1220627, 1220683, 1220683, 1220732, 1220760, 1220760,
1220760, 1220760, 1220760, 1220775, 1220760, 1220760],
'B': ['a', 'a', 'b', 'a', 'a', 'A', 'a', 'A', 'b', 'a', 'a', 'a', 'a'],
'C': [10.0, 12.0, 3.0, 3.0, 59.0, 16.0, 16.0, 19.0, 19.0, 19.0, 3.0, 11.0, 0],
'Stime': ['18:00:00', '18:15:00', '18:36:00', '18:36:00', '18:00:00',
'18:24:00', '18:24:00', '18:40:00', '18:40:00', '18:40:00',
'18:03:00', '18:59:00', '18:00:00'],
'Etime': ['18:09:59', '18:26:59', '18:38:59', '18:38:59', '18:58:59',
'18:39:59', '18:39:59', '18:58:59', '18:58:59', '18:58:59',
'18:05:59', '19:09:59', '18:09:59'],})
for col in ['Stime', 'Etime']:
df[col] = pd.to_datetime(df[col])
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
result = result.reset_index()
result['C'] = (result['Etime']-result['Stime']).dt.total_seconds() / 60.0
result = result[['A', 'B', 'C', 'Stime', 'Etime']]
print(result)
使用 A B C Stime Etime
0 1220627 a 10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
1 1220627 a 12.0 2016-05-29 18:15:00 2016-05-29 18:27:00
2 1220683 a 3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
3 1220683 b 3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
4 1220732 a 59.0 2016-05-29 18:00:00 2016-05-29 18:59:00
5 1220760 A 35.0 2016-05-29 18:24:00 2016-05-29 18:59:00
6 1220760 a 10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
7 1220760 a 46.0 2016-05-29 18:24:00 2016-05-29 19:10:00
8 1220760 b 19.0 2016-05-29 18:40:00 2016-05-29 18:59:00
9 1220775 a 3.0 2016-05-29 18:03:00 2016-05-29 18:06:00
形式的半开区间的一个优点
而不是完全闭合的间隔[start, end)
是当两个间隔邻接时,
一个的[start, end]
等于下一个的end
。
另一个优点是半开间隔的分钟数等于
start
。使用完全关闭的间隔时,公式变为end-start
。
Python的内置end-start+1
和列表切片语法使用半开区间for
these same
reasons。所以我
建议在DataFrame中使用半开间隔range
太
答案 1 :(得分:1)
这种做法怎么样?
In [68]: df.groupby(['A','B', df.Stime - df['Etime'].shift() <= pd.Timedelta('1S')], as_index=False)['C'].sum()
Out[68]:
A B C
0 1220627 a 22.0
1 1220683 a 3.0
2 1220683 b 3.0
3 1220732 a 59.0
4 1220760 A 35.0
5 1220760 a 35.0
6 1220760 b 19.0
7 1220775 a 3.0
答案 2 :(得分:0)
好的我认为有一个解决方案,但它非常粗糙,我确信有人可以改进它。
假设df =
您在上面提供的数据:
df['Stime'] = pd.to_datetime(df['Stime'], format='%H:%M:%S') # needs to be converted to datetime
df['Etime'] = pd.to_datetime(df['Etime'], format='%H:%M:%S') # needs to be converted to datetime
df = df.sort_values(['A','B','Stime']) # data needs to be sorted by unique person : Stime
df = df.reset_index(drop=True)
df = df.reset_index()
def new_person(row):
if row.name > 0:
if row['A'] != df.ix[row.name-1][1] or row['B'] != df.ix[row.name-1][2]:
return 'Yes'
def update(row):
if row.name > 0:
if row['B'] == df.ix[row.name-1][2]:
if df.ix[row.name][4] - df.ix[row.name-1][5] >= pd.Timedelta(seconds=0) and df.ix[row.name][4] - df.ix[row.name-1][5] < pd.Timedelta(seconds=2):
return df.groupby(['A','B'])['C'].cumsum().ix[row.name]
def rewrite(row):
if row['update'] > 0:
return row['update']
else:
return row['C']
df['new_person'] = df.apply(new_person, axis=1) # adds column where value = 'Yes' if person is not the same as row above
df['update'] = df.apply(update,axis=1) # adds a column 'update' to allow for a cumulative sum rewritten to 'C' in rewrite function
print df
df['Stime'] = pd.to_datetime(df['Stime'], format='%H:%M:%S').dt.time # removes date from datetime
df['Etime'] = pd.to_datetime(df['Etime'], format='%H:%M:%S').dt.time # removes date from datetime
df['C'] = df.apply(rewrite,axis=1) # rewrites values for 'C' column
# hacky way of combining idxmax and indices of rows where the person is 'new'
updated = df.groupby(['A','B'])['C'].agg(pd.Series.idxmax).values
not_updated = df['new_person'].isnull().tolist()
combined = [x for x in df.index if (x in updated or x in not_updated)]
df = df.iloc[combined]
df = df.drop(['new_person','update','index'],axis=1)
print df
为极其苛刻的答案道歉,但我认为它应该达到你所需要的。如果数据帧非常大,不确定它的效果如何。
结果数据框:
A B C Stime Etime
0 1220627 a 10 18:00:00 18:09:59
1 1220627 a 12 18:15:00 18:26:59
2 1220683 a 3 18:36:00 18:38:59
3 1220683 b 3 18:36:00 18:38:59
4 1220732 a 59 18:00:00 18:58:59
6 1220760 A 35 18:40:00 18:58:59
9 1220760 a 46 18:59:00 18:09:59
10 1220760 b 19 18:40:00 18:58:59
11 1220775 a 3 18:03:00 18:05:59