熊猫-计算列中的数据到另一列

时间:2019-08-20 21:23:54

标签: python pandas

考虑以下数据框:

df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df.tail(6)

    week extra_hours  extra_hours_over
12     3    01:00:00               NaN
13     3    02:00:00               NaN
14     3    01:00:00               NaN
15     3    02:00:00               NaN
16     3    00:00:00               NaN
17     3    00:00:00               NaN

现在,每周extra_hours的最大数量为4h,这意味着我必须从extra_hour列中减去30分钟的块,并填充extra_hour_over列,这样一周,extra_hour的总和最多为4小时。

因此,在给出示例数据帧的情况下,可能的解决方案(第3周的 )如下:

    week  extra_hours  extra_hours_over
12     3     01:00:00          00:00:00
13     3     01:30:00          00:30:00
14     3     00:30:00          00:30:00
15     3     01:00:00          01:00:00
16     3     00:00:00          00:00:00
17     3     00:00:00          00:00:00

我需要每周总计extra_hours,检查它经过4h的天数,然后随机减去半小时的数据块。

实现这一目标的最简单/最直接的方法是什么?

1 个答案:

答案 0 :(得分:1)

尝试尝试一下您要问的问题。这个想法很简单,尽管代码相当冗长:

1)创建一些帮助变量(分钟,分钟,一周的总计)

2)循环浏览仅包含总和> 240分钟的临时数据集。

3)在循环中,使用random.choice选择一个从中删除30分钟的时间。

4)将更改应用于分钟和额外的分钟

代码:

df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")

df['minutes'] = pd.DatetimeIndex(df['extra_hours']).hour * 60 + pd.DatetimeIndex(df['extra_hours']).minute
df['extra_minutes'] = 0

df['tot_time'] =  df.groupby('week')['minutes'].transform('sum')

while not df[df['tot_time'] > 240].empty:
    mask = df[(df['minutes']>=30)&(df['tot_time']>240)].groupby('week').apply(lambda x: np.random.choice(x.index)).values
    df.loc[mask,'minutes'] -= 30
    df.loc[mask,'extra_minutes'] += 30

    df['tot_time'] =  df.groupby('week')['minutes'].transform('sum')

df['extra_hours_over'] = df['extra_minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df['extra_hours'] = df['minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df.drop(['minutes','extra_minutes'], axis=1).tail(6)

Out[1]:
    week    extra_hours     extra_hours_over    tot_time
12  3       00:30:00        00:30:00             240
13  3       01:30:00        00:30:00             240
14  3       00:30:00        00:30:00             240
15  3       01:30:00        00:30:00             240
16  3       00:00:00        00:00:00             240
17  3       00:00:00        00:00:00             240

注意:因为我使用的是np.random.choice,所以同一观察值可以被选择两次,这会使该观察值改变30分钟以上。