给出一个数据框,如下所示:
import pandas as pd
import datetime
df = pd.DataFrame([[2, 3],[2, 1],[2, 1],[3, 4],[3, 1],[3, 1],[3, 1],[3, 1],[4, 2],[4, 1],[4, 1],[4, 1]], columns=['id', 'count'])
df['date'] = [datetime.datetime.strptime(x,'%Y-%m-%d %H:%M:%S') for x in
['2016-12-28 15:17:00','2016-12-28 15:29:00','2017-01-05 09:32:00','2016-12-03 18:10:00','2016-12-10 11:31:00',
'2016-12-14 09:32:00','2016-12-18 09:31:00','2016-12-22 09:32:00','2016-11-28 15:31:00','2016-12-01 16:11:00',
'2016-12-10 09:31:00','2016-12-13 12:06:00']]
我想基于以下条件进行grouby:对于数据具有相同的id
,如果它们的日期差小于4天,则将它们视为同一组,否则创建一个新列{{1 }},然后我将根据new_id
进行grouby计数并计算总和。
我用下面的代码得到了结果,但是它太慢了,我怎样才能使它更有效?
new_id
出局:
df.sort_values(by=['id', 'date'], ascending = [True, False], inplace = True)
df['id'] = df['id'].astype(str)
df['id_up'] = df['id'].shift(-1)
df['id_down'] = df['id'].shift(1)
df['date_up'] = df['date'].shift(-1)
df['date_diff'] = df.apply(lambda df: (df['date'] - df['date_up'])/datetime.timedelta(days=1) if df['id'] == df['id_up'] else 0, axis=1)
df = df.reset_index()
df = df.drop(['index','id_up','id_down','date_up'],axis=1)
df['new'] = ''
for i in range(df.shape[0]):
if i == 0:
df.loc[i,'new'] = 1
else:
if df.loc[i,'id'] != df.loc[i-1,'id']:
df.loc[i,'new'] = 1
else:
if df.loc[i-1,'date_diff'] <= 4:
df.loc[i,'new'] = df.loc[i-1,'new']
else:
df.loc[i,'new'] = df.loc[i-1,'new'] + 1
df['new'] = df['id'].astype(str) + '-' + df['new'].astype(str)
df1 = df.groupby('new')['date'].min()
df1 = df1.reset_index()
df1.rename(columns={"date": "first_date"}, inplace=True)
df = pd.merge(df, df1, on='new')
df1 = df.groupby('new')['date'].max()
df1 = df1.reset_index()
df1.rename(columns={"date": "last_date"}, inplace=True)
df = pd.merge(df, df1, on='new')
df1 = df.groupby('new')['count'].sum()
df1 = df1.reset_index()
df1.rename(columns={"count": "count_sum"}, inplace=True)
df = pd.merge(df, df1, on='new')
print(df)
答案 0 :(得分:1)
要获取new
列,您可以执行以下操作:
df.sort_values(by=['id', 'date'], ascending = [True, False], inplace = True)
groups = df.groupby('id')
# mask where the date differences exceed threshold
df['new'] = groups.date.diff().abs() > pd.to_timedelta(4, unit='D')
# group within each id
df['new'] = groups['new'].cumsum().astype(int) + 1
# concatenate `id` and `new`:
df['new'] = df['id'].astype(str) + '-' + df['new'].astype(str)
# get other columns with groupby
new_groups = df.groupby('new')
df['first_date'] = new_groups.date.transform('min')
df['last_date'] = new_groups.date.transform('max')
df['count_sum'] = new_groups['count'].transform('sum')
输出:
id count date new first_date last_date count_sum
-- ---- ------- ------------------- ----- ------------------- ------------------- -----------
0 2 1 2017-01-05 09:32:00 2-1 2017-01-05 09:32:00 2017-01-05 09:32:00 1
1 2 1 2016-12-28 15:29:00 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
2 2 3 2016-12-28 15:17:00 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
3 3 1 2016-12-22 09:32:00 3-1 2016-12-22 09:32:00 2016-12-22 09:32:00 1
4 3 1 2016-12-18 09:31:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
5 3 1 2016-12-14 09:32:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
6 3 1 2016-12-10 11:31:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
7 3 4 2016-12-03 18:10:00 3-3 2016-12-03 18:10:00 2016-12-03 18:10:00 4
8 4 1 2016-12-13 12:06:00 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
9 4 1 2016-12-10 09:31:00 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
10 4 1 2016-12-01 16:11:00 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
11 4 2 2016-11-28 15:31:00 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
答案 1 :(得分:1)
在大熊猫中,groupby
可以采用将行索引转换为按标签分组的函数,并且在每行上迭代调用该函数。通过使用它,我们可以执行以下操作:
# sort dataframe by id and date in ascending order
df = df.sort_values(["id", "date"]).reset_index(drop=True)
# global variable for convenience of demonstration
lastid = maxdate = None
groupid = 0
def grouper(rowidx):
global lastid, maxdate, groupid
row = df.loc[rowidx]
if lastid != row['id'] or maxdate < row['date']:
# see next group
lastid = row['id']
maxdate = row['date'] + datetime.timedelta(days=4)
groupid += 1
return groupid
# use grouper to split df into groups
for id, group in df.groupby(grouper):
print("[%s]" % id)
print(group)
使用您的df
的上述输出为:
[1]
id count date
0 2 3 2016-12-28 15:17:00
1 2 1 2016-12-28 15:29:00
[2]
id count date
2 2 1 2017-01-05 09:32:00
[3]
id count date
3 3 4 2016-12-03 18:10:00
[4]
id count date
4 3 1 2016-12-10 11:31:00
5 3 1 2016-12-14 09:32:00
[5]
id count date
6 3 1 2016-12-18 09:31:00
[6]
id count date
7 3 1 2016-12-22 09:32:00
[7]
id count date
8 4 2 2016-11-28 15:31:00
9 4 1 2016-12-01 16:11:00
[8]
id count date
10 4 1 2016-12-10 09:31:00
11 4 1 2016-12-13 12:06:00
您可以使用此机制进行任意分组逻辑。
答案 2 :(得分:0)
另一种解决方案:
df.sort_values(by=['id', 'date'], ascending=[True, False], inplace=True)
interval_date = 4
groups = df.groupby('id')
# interval_date = pd.to_timedelta(4, unit='D')
df['date_diff_down'] = groups.date.diff(-1).abs()/timedelta(days=1)
df = df.fillna(method='ffill')
df['date_diff_up'] = groups.date.diff(1).abs()/timedelta(days=1)
df = df.fillna(method='bfill')
df['data_chunk_mark'] = df.apply(lambda df: 0 if df['date_diff_up'] < interval_date else 1, axis=1)
groups = df.groupby('id')
df['new_id'] = groups['data_chunk_mark'].cumsum().astype(int) + 1
df['new_id'] = df['id'].astype(str) + '-' + df['new_id'].astype(str)
new_groups = df.groupby('new_id')
# df['first_date'] = new_groups.date.transform('min')
# df['last_date'] = new_groups.date.transform('max')
df['count_sum'] = new_groups['count'].transform('sum')
print(df)
出局:
id count date date_diff_down date_diff_up \
1 2 1 2017-01-05 09:32:00 7.752083 7.752083
2 2 1 2016-12-28 15:29:00 0.008333 7.752083
0 2 3 2016-12-28 15:17:00 0.008333 0.008333
7 3 1 2016-12-22 09:32:00 4.000694 4.000694
6 3 1 2016-12-18 09:31:00 3.999306 4.000694
5 3 1 2016-12-14 09:32:00 3.917361 3.999306
4 3 1 2016-12-10 11:31:00 6.722917 3.917361
3 3 4 2016-12-03 18:10:00 6.722917 6.722917
11 4 1 2016-12-13 12:06:00 3.107639 3.107639
10 4 1 2016-12-10 09:31:00 8.722222 3.107639
9 4 1 2016-12-01 16:11:00 3.027778 8.722222
8 4 2 2016-11-28 15:31:00 3.027778 3.027778
data_chunk_mark new_id count_sum
1 1 2-2 1
2 1 2-3 4
0 0 2-3 4
7 1 3-2 1
6 1 3-3 3
5 0 3-3 3
4 0 3-3 3
3 1 3-4 4
11 0 4-1 2
10 0 4-1 2
9 1 4-2 3
8 0 4-2 3