考虑以下数据框:
df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df.tail(6)
week extra_hours extra_hours_over
12 3 01:00:00 NaN
13 3 02:00:00 NaN
14 3 01:00:00 NaN
15 3 02:00:00 NaN
16 3 00:00:00 NaN
17 3 00:00:00 NaN
现在,每周extra_hours
的最大数量为4h,这意味着我必须从extra_hour
列中减去30分钟的块,并填充extra_hour_over
列,这样一周,extra_hour
的总和最多为4小时。
因此,在给出示例数据帧的情况下,可能的解决方案(第3周的 )如下:
week extra_hours extra_hours_over
12 3 01:00:00 00:00:00
13 3 01:30:00 00:30:00
14 3 00:30:00 00:30:00
15 3 01:00:00 01:00:00
16 3 00:00:00 00:00:00
17 3 00:00:00 00:00:00
我需要每周总计extra_hours
,检查它经过4h的天数,然后随机减去半小时的数据块。
实现这一目标的最简单/最直接的方法是什么?
答案 0 :(得分:1)
尝试尝试一下您要问的问题。这个想法很简单,尽管代码相当冗长:
1)创建一些帮助变量(分钟,分钟,一周的总计)
2)循环浏览仅包含总和> 240分钟的临时数据集。
3)在循环中,使用random.choice
选择一个从中删除30分钟的时间。
4)将更改应用于分钟和额外的分钟
代码:
df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df['minutes'] = pd.DatetimeIndex(df['extra_hours']).hour * 60 + pd.DatetimeIndex(df['extra_hours']).minute
df['extra_minutes'] = 0
df['tot_time'] = df.groupby('week')['minutes'].transform('sum')
while not df[df['tot_time'] > 240].empty:
mask = df[(df['minutes']>=30)&(df['tot_time']>240)].groupby('week').apply(lambda x: np.random.choice(x.index)).values
df.loc[mask,'minutes'] -= 30
df.loc[mask,'extra_minutes'] += 30
df['tot_time'] = df.groupby('week')['minutes'].transform('sum')
df['extra_hours_over'] = df['extra_minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df['extra_hours'] = df['minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df.drop(['minutes','extra_minutes'], axis=1).tail(6)
Out[1]:
week extra_hours extra_hours_over tot_time
12 3 00:30:00 00:30:00 240
13 3 01:30:00 00:30:00 240
14 3 00:30:00 00:30:00 240
15 3 01:30:00 00:30:00 240
16 3 00:00:00 00:00:00 240
17 3 00:00:00 00:00:00 240
注意:因为我使用的是np.random.choice
,所以同一观察值可以被选择两次,这会使该观察值改变30分钟以上。