我正在处理一个很大的excel文件,我将其读入了数据框。我清理/过滤数据框,以便只剩下想要的数据。新数据框如下所示;
Date DirectionName FlagText
0 2018-02-02 00:00:03.000 South Friday
1 2018-02-02 00:00:22.010 South Friday
2 2018-02-02 00:00:22.020 South Friday
3 2018-02-02 00:00:36.040 South Friday
4 2018-02-02 00:00:49.070 South Friday
... ... ... ...
445632 2018-02-23 23:59:28.070 South Friday
445633 2018-02-23 23:59:29.000 South Friday
445634 2018-02-23 23:59:33.090 South Friday
445635 2018-02-23 23:59:45.070 South Friday
445636 2018-02-23 23:59:50.080 South Friday
然后我根据天/小时进行分组,这样我就可以看到一个小时内的数据量。我正在使用此代码来实现此目标:
pd.set_option('display.max_rows', None)
df['Day/Hour'] = df['Date'].apply(lambda x: "%d/%d" % (x.day, x.hour))
df.groupby(['Day/Hour', 'DirectionName']).size()
输出显示四个日期02/02 / 18、09 / 02 / 18、16 / 02 / 18、23 / 02/18,并显示每个日期的24小时范围。它已经计算出一个小时范围内有多少“ SOUTH”……例如,在2月16日凌晨0-1,它计算了219个南数据。
Day/Hour DirectionName
16/0 South 219
16/1 South 163
16/10 South 1594
16/11 South 1775
16/12 South 2026
16/13 South 2111
16/14 South 2400
16/15 South 2588
16/16 South 2927
16/17 South 2690
16/18 South 2071
16/19 South 1513
16/2 South 87
16/20 South 1025
16/21 South 798
16/22 South 831
16/23 South 590
16/3 South 125
16/4 South 117
16/5 South 290
16/6 South 802
16/7 South 1760
16/8 South 1964
16/9 South 1592
2/0 South 250
2/1 South 137
2/10 South 1493
2/11 South 1716
2/12 South 1970
2/13 South 2081
2/14 South 2363
2/15 South 2583
2/16 South 2746
2/17 South 2647
2/18 South 2107
2/19 South 1521
2/2 South 92
2/20 South 1047
2/21 South 851
2/22 South 813
2/23 South 557
2/3 South 92
2/4 South 110
2/5 South 272
2/6 South 832
2/7 South 1972
2/8 South 2106
2/9 South 1695
23/0 South 214
23/1 South 123
23/10 South 1592
23/11 South 1767
23/12 South 2030
23/13 South 2046
23/14 South 2387
23/15 South 2616
23/16 South 2796
23/17 South 2581
23/18 South 1979
23/19 South 1490
23/2 South 95
23/20 South 1056
23/21 South 858
23/22 South 783
23/23 South 563
23/3 South 83
23/4 South 134
23/5 South 265
23/6 South 803
23/7 South 1859
23/8 South 2089
23/9 South 1670
9/0 South 222
9/1 South 114
9/10 South 1505
9/11 South 1688
9/12 South 1888
9/13 South 2052
9/14 South 2366
9/15 South 2656
9/16 South 2906
9/17 South 2728
9/18 South 2043
9/19 South 1488
9/2 South 87
9/20 South 1097
9/21 South 840
9/22 South 711
9/23 South 628
9/3 South 88
9/4 South 134
9/5 South 293
9/6 South 890
9/7 South 1941
9/8 South 2095
9/9 South 1639
我对于如何无法操作此新数据框感到困惑。我希望能够将相同时间的数字总和相加。
例如:
2/0 South 250 +
9/0 South 222 +
16/0 South 219 +
23/0 South 214 +
= 250 + 222 + 219 + 214 = 905.
我需要找到总计,然后才能找到平均值。 905/4 = 226.25 以这个平均值,我将知道,在2018年2月的0AM和1AM之间的星期五记录的数据量为226.5。
我希望我已经解释得足够好了,我知道这有点令人困惑。我试图避免添加原始数据集,因为我认为这是不必要的。帮助将不胜感激。
答案 0 :(得分:0)
据我了解,您无法合并具有相同Hour
值的值。我认为这是因为您将它们与数据放在一起了。我的解决方案:
df['Day'] = df['Date'].apply(lambda x: "%d" % (x.day))
df['Hour'] = df['Date'].apply(lambda x: "%d" % (x.hour))
counts = df.groupby(['Day', 'Hour', 'DirectionName'])['FlagText'].apply('count')
counts = counts.reset_index()
由于您已将日期和小时分开,因此现在可以总结在零时发生的所有事情:
counts_zero_hour = counts[counts['Hour'] == 0]['FlagText']
result = counts_zero_hour.sum()/counts_zero_hour.shape[0]