我有一个 DataFrame ,其中包含日期,类别,以及一列,该列显示一次性的事件发生了该类别。我想用直到发生的时间来创建新列,或者要创建一个没有事件的指示符,例如负时间。
数据集很大,而且我想有一种更好的解决方案,而不是通过循环让蛮横的熊猫更好地解决这个问题!
因此,简而言之,如果我创建这样的数据集:
import pandas as pd
#create example dataset
data = {'categories':['a','b','c']*4,'dates':[i for i in range(4) for j in range(3)],'event':[0]*3*4}
#add a couple of events
data['event'][4] = 1
data['event'][9] = 1
df = pd.DataFrame(data)
我如何最好地获得如下所示的输出?
categories dates event time_until
0 a 0 0 3
1 b 0 0 1
2 c 0 0 -1
3 a 1 0 2
4 b 1 1 0
5 c 1 0 -1
6 a 2 0 1
7 b 2 0 -1
8 c 2 0 -1
9 a 3 1 0
10 b 3 0 -1
11 c 3 0 -1
感谢您的帮助!
答案 0 :(得分:1)
使用groupby
def f(s):
s = s.reset_index(drop=True)
one = s[s.eq(1)]
if one.empty: return -1
return -s.index + one.index[0]
df.groupby('categories').event.transform(f)
categories dates event time_until
0 a 0 0 3
1 b 0 0 1
2 c 0 0 -1
3 a 1 0 2
4 b 1 1 0
5 c 1 0 -1
6 a 2 0 1
7 b 2 0 -1
8 c 2 0 -1
9 a 3 1 0
10 b 3 0 -2
11 c 3 0 -1
请注意,即使事件发生后,它也会找到距离。因此,对于以下事件,您将获得以下输出
event = [0, 0, 0, 1, 0, 0]
until = [3, 2, 1, 0, -1, -2]
如果您需要用-1
保留所有负值,那么只需在末尾进行调整
df.time_until.where(df.time_until >= -1, -1)
答案 1 :(得分:0)
替代解决方案:
df.sort_values(by=['categories', 'dates'], ascending=[True, False], inplace=True)
df['tmp'] = df.groupby('categories')['event'].transform('cumsum')
df['time_until'] = df.groupby('categories')['tmp'].transform('cumsum') - 1
df.drop(columns='tmp', inplace=True)
df.sort_values(by=['dates', 'categories'], ascending=[True, True], inplace=True)
输出:
categories dates event time_until
0 a 0 0 3
1 b 0 0 1
2 c 0 0 -1
3 a 1 0 2
4 b 1 1 0
5 c 1 0 -1
6 a 2 0 1
7 b 2 0 -1
8 c 2 0 -1
9 a 3 1 0
10 b 3 0 -1
11 c 3 0 -1
答案 2 :(得分:-1)
尝试这样的事情:
import pandas as pd
import numpy as np
data = {'categories':['a','b','c']*4,
'dates':[i for i in range(4) for j in range(3)],
'event':[0, 1, 0]*4}
df = pd.DataFrame(data)
print(df)
# One way
df.loc[df.event == 0, 'Newevents'] = 'Cancelled'
df.loc[df.event != 0, 'Newevents'] = 'Scheduled'
# Another way
conditions = [
(df['categories'] == "a"),
(df['categories'] == "b"),
(df['categories'] == "c")]
choices = ['None', 'Completed', 'Scheduled']
df['NewCategories'] = np.select(conditions, choices, default='black')
print(df)
输出:
categories dates event
0 a 0 0
1 b 0 1
2 c 0 0
3 a 1 0
4 b 1 1
5 c 1 0
6 a 2 0
7 b 2 1
8 c 2 0
9 a 3 0
10 b 3 1
11 c 3 0
categories dates event Newevents NewCategories
0 a 0 0 Cancelled None
1 b 0 1 Scheduled Completed
2 c 0 0 Cancelled Scheduled
3 a 1 0 Cancelled None
4 b 1 1 Scheduled Completed
5 c 1 0 Cancelled Scheduled
6 a 2 0 Cancelled None
7 b 2 1 Scheduled Completed
8 c 2 0 Cancelled Scheduled
9 a 3 0 Cancelled None
10 b 3 1 Scheduled Completed
11 c 3 0 Cancelled