熊猫基于多个行和条件进行计算

时间:2020-05-20 12:25:50

标签: python pandas

我不喜欢熊猫。需要计算每个人,每个位置的时间,并删除不带成对日期的行col。 我的数据如下:

Unit    Name    Location    Date    Time
0  K1  Somebody1    LOC1  2020-05-12  07:00
1  K1  Somebody1    LOC1  2020-05-12  20:10
2  K1  Somebody1    LOC1  2020-05-13  06:00
3  K1  Somebody1    LOC1  2020-05-13  20:00
4  K1  Somebody1    LOC1  2020-05-14  06:37
5  K1  Somebody1    LOC2  2020-05-15  07:00
6  K1  Somebody1    LOC2  2020-05-15  20:10
7  K1  Somebody1    LOC2  2020-05-16  06:00
8  K1  Somebody1    LOC2  2020-05-16  20:00
9  K1  Somebody1    LOC2  2020-05-17  06:37
10  K1  Somebody2    LOC2  2020-05-13  07:00
11  K1  Somebody2    LOC2  2020-05-14  10:10
12  K1  Somebody2    LOC2  2020-05-14  16:50
13  K1  Somebody2    LOC2  2020-05-15  05:36
14  K1  Somebody3    LOC1  2020-05-13  07:00
15  K1  Somebody3    LOC1  2020-05-14  10:10
16  K1  Somebody3    LOC1  2020-05-14  16:50
17  K1  Somebody3    LOC1  2020-05-15  05:36

我只想过将时间转换为日期时间对象

df['Time'] = df['Time'].apply(lambda x: datetime.strptime(x,'%H:%M').time())

尝试使用数据透视表,分组依据,进行循环,我没有主意。 我希望输出看起来像这样:

LOC1
      Somebody1  2020-05-12  13h 10m
                 2020-05-13  14h 00m
TOTAL                        27h 00m
      Somebody2  date        hours
                 date        hours
TOTAL                        sum for somebody2
      Somebody3  date        hours
                 date        hours
TOTAL                        sum for somebody3

LOC2
      Somebody1  date        hours
                 date        hours
TOTAL                        sum for somebody1
      Somebody2  date        hours   
                 date        hours
TOTAL                        sum for somebody2

或类似的东西

2 个答案:

答案 0 :(得分:1)

IIUC groupbycombine first

import numpy as np
df['datetime'] = pd.to_datetime(df['Date'] + ' ' +  df['Time'])

df1 = df.groupby(['Name','Location', df['datetime'].dt.normalize()])\
                                  .agg(start=('datetime','first'),
                                   end=('datetime','last'))

df1['timespent'] = (df1['end'] - df1['start']) / np.timedelta64(1,'h')

# create total row.
m = df1.unstack(['Name','Location'])['timespent'].sum().unstack()
m = m.assign(TOTAL=m.sum(1)).stack().to_frame('timespent')



final = df1.drop(['start','end'],axis=1).combine_first(m)

#if you want to remove single entry days
final[final['timespent'] > 0]

                               timespent
Name      Location datetime             
Somebody1 LOC1     2020-05-12  13.166667
                   2020-05-13  14.000000
          TOTAL    NaT         27.166667
Somebody2 LOC2     2020-05-14   6.666667
          TOTAL    NaT          6.666667

答案 1 :(得分:0)

您可以从grep开始收集每两行的时间,然后计算时间差。例如,将人的名字解析为一个列表,然后使用grep do:

for i in $(cat list-names);do grep $i a.csv | awk '{print$6}';done 

其中a.csv:

0  K1  Somebody1    LOC1  2020-05-12  17:00
1  K1  Somebody1    LOC1  2020-05-12  20:10

此外,要抓住小时数的差异,请执行以下操作:

awk '
    NR == 1{old = $6; next}     
    {print $6 - old; old = $6}  
' a.csv