熊猫-对如何操纵数据框感到困惑

时间:2020-11-06 16:58:55

标签: python pandas dataframe datetime

我正在处理一个很大的excel文件,我将其读入了数据框。我清理/过滤数据框,以便只剩下想要的数据。新数据框如下所示;

Date    DirectionName   FlagText
0   2018-02-02 00:00:03.000 South   Friday
1   2018-02-02 00:00:22.010 South   Friday
2   2018-02-02 00:00:22.020 South   Friday
3   2018-02-02 00:00:36.040 South   Friday
4   2018-02-02 00:00:49.070 South   Friday
... ... ... ...
445632  2018-02-23 23:59:28.070 South   Friday
445633  2018-02-23 23:59:29.000 South   Friday
445634  2018-02-23 23:59:33.090 South   Friday
445635  2018-02-23 23:59:45.070 South   Friday
445636  2018-02-23 23:59:50.080 South   Friday

然后我根据天/小时进行分组,这样我就可以看到一个小时内的数据量。我正在使用此代码来实现此目标:

pd.set_option('display.max_rows', None)
df['Day/Hour'] = df['Date'].apply(lambda x: "%d/%d" % (x.day, x.hour))
df.groupby(['Day/Hour', 'DirectionName']).size()

输出显示四个日期02/02 / 18、09 / 02 / 18、16 / 02 / 18、23 / 02/18,并显示每个日期的24小时范围。它已经计算出一个小时范围内有多少“ SOUTH”……例如,在2月16日凌晨0-1,它计算了219个南数据。

Day/Hour  DirectionName
16/0      South             219
16/1      South             163
16/10     South            1594
16/11     South            1775
16/12     South            2026
16/13     South            2111
16/14     South            2400
16/15     South            2588
16/16     South            2927
16/17     South            2690
16/18     South            2071
16/19     South            1513
16/2      South              87
16/20     South            1025
16/21     South             798
16/22     South             831
16/23     South             590
16/3      South             125
16/4      South             117
16/5      South             290
16/6      South             802
16/7      South            1760
16/8      South            1964
16/9      South            1592
2/0       South             250
2/1       South             137
2/10      South            1493
2/11      South            1716
2/12      South            1970
2/13      South            2081
2/14      South            2363
2/15      South            2583
2/16      South            2746
2/17      South            2647
2/18      South            2107
2/19      South            1521
2/2       South              92
2/20      South            1047
2/21      South             851
2/22      South             813
2/23      South             557
2/3       South              92
2/4       South             110
2/5       South             272
2/6       South             832
2/7       South            1972
2/8       South            2106
2/9       South            1695
23/0      South             214
23/1      South             123
23/10     South            1592
23/11     South            1767
23/12     South            2030
23/13     South            2046
23/14     South            2387
23/15     South            2616
23/16     South            2796
23/17     South            2581
23/18     South            1979
23/19     South            1490
23/2      South              95
23/20     South            1056
23/21     South             858
23/22     South             783
23/23     South             563
23/3      South              83
23/4      South             134
23/5      South             265
23/6      South             803
23/7      South            1859
23/8      South            2089
23/9      South            1670
9/0       South             222
9/1       South             114
9/10      South            1505
9/11      South            1688
9/12      South            1888
9/13      South            2052
9/14      South            2366
9/15      South            2656
9/16      South            2906
9/17      South            2728
9/18      South            2043
9/19      South            1488
9/2       South              87
9/20      South            1097
9/21      South             840
9/22      South             711
9/23      South             628
9/3       South              88
9/4       South             134
9/5       South             293
9/6       South             890
9/7       South            1941
9/8       South            2095
9/9       South            1639

我对于如何无法操作此新数据框感到困惑。我希望能够将相同时间的数字总和相加。

例如:

2/0       South             250 +
9/0       South             222 +
16/0      South             219 + 
23/0      South             214 +
= 250 + 222 + 219 + 214 = 905.

我需要找到总计,然后才能找到平均值。 905/4 = 226.25 以这个平均值,我将知道,在2018年2月的0AM和1AM之间的星期五记录的数据量为226.5。

我希望我已经解释得足够好了,我知道这有点令人困惑。我试图避免添加原始数据集,因为我认为这是不必要的。帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

据我了解,您无法合并具有相同Hour值的值。我认为这是因为您将它们与数据放在一起了。我的解决方案:

df['Day'] = df['Date'].apply(lambda x: "%d" % (x.day))
df['Hour'] = df['Date'].apply(lambda x: "%d" % (x.hour))

counts = df.groupby(['Day', 'Hour', 'DirectionName'])['FlagText'].apply('count')
counts = counts.reset_index()

由于您已将日期和小时分开,因此现在可以总结在零时发生的所有事情:

counts_zero_hour = counts[counts['Hour'] == 0]['FlagText']
result = counts_zero_hour.sum()/counts_zero_hour.shape[0]