Python&熊猫:以2小时为增量计算时间数据

时间:2017-04-26 20:13:30

标签: python pandas

我正在尝试将一堆时间序列数据分组为2小时的块。我对此很新,请耐心等待我。我想我可以根据之前的研究使用熊猫。

我有一个如下所示的数据集(mytime):

['15:23', '14:41', '13:54', '07:13', '20:21', '13:15', '14:48', '12:06', '08:37', '06:32', '07:04', '14:20', '16:28',     
'06:49', '08:39', '09:15', '08:54', '05:37', '14:43', '06:20', '11:25', '11:05', '09:28', '14:05', '14:24', '15:30', 
'13:28', '16:55', '09:29', '17:44', '07:24', '09:37', '06:47', '14:35', '10:55', '22:29', '06:24', '09:25', '06:45', 
'23:49', '19:34', '01:31', '14:22', '13:58', '09:08', '05:11', '08:09', '08:52', '02:50', '12:51', '17:33', '07:07', 
'08:11', '10:06', '23:48', '22:27', '11:15', '15:09', '16:45', '20:42', '12:12', '07:08', '16:13', '20:40', '17:26', 
'18:57', '15:07', '09:19', '09:10', '09:17', '09:26', '14:18', '06:31', '14:13', '14:01', '08:57', '21:34']

我想采用这个数据集,基本上看到这样的输出:

0-2: 4
2-4: 7
4-6: 3
6-8: 3
8-10: 2
10-12: 5
12-14: 14
....etc

这是我的代码的子集

import csv
from collections import Counter
import pandas as pd
import numpy as np

mycount = Counter()
mytime = []
with open('temp_dates.csv') as csvfile2:
    readCSV2 = csv.reader(csvfile2, delimiter=',')
    incoming = []
    for row in readCSV2:
         readin = row[0]
         time = row[1]
         year, month, day = (int(x) for x in readin.split('-'))
         ans = datetime.date(year, month, day)
         wkday = ans.strftime("%A")
         incoming.append([wkday,time])
         mycount[wkday] += 1
         mytime.append(time)
    with open('new_dates2.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerows(incoming)
csvfile2.close()

for key,value in sorted(mycount.iteritems()):
    daylist = key, value
    print(daylist)

#print(mytime)
df = pd.DataFrame()
#print(df)
df.groupby([df['mytime'],pd.TimeGrouper(freq='2H')])

我猜我的第一个问题是数据格式不正确,TimeGrouper无法理解?其次,我可能遗漏了一些告诉数据框要看什么的东西?任何帮助,将不胜感激。

根据请求,原始源CSV文件的片段如下(我们只讨论填充到'mytime'的第2列)。

Sunday,14:35
Sunday,10:55
Friday,22:29
Friday,06:24
Thursday,09:25
Wednesday,06:45

3 个答案:

答案 0 :(得分:1)

<强>更新

In [96]: mytime = ['15:23', '14:41', '13:54', '07:13', '20:21', '13:15', '14:48', '12:06', '08:37', '06:32', '07:04', '14:20', '16:28',
    ...:
    ...: '06:49', '08:39', '09:15', '08:54', '05:37', '14:43', '06:20', '11:25', '11:05', '09:28', '14:05', '14:24', '15:30',
    ...: '13:28', '16:55', '09:29', '17:44', '07:24', '09:37', '06:47', '14:35', '10:55', '22:29', '06:24', '09:25', '06:45',
    ...: '23:49', '19:34', '01:31', '14:22', '13:58', '09:08', '05:11', '08:09', '08:52', '02:50', '12:51', '17:33', '07:07',
    ...: '08:11', '10:06', '23:48', '22:27', '11:15', '15:09', '16:45', '20:42', '12:12', '07:08', '16:13', '20:40', '17:26',
    ...: '18:57', '15:07', '09:19', '09:10', '09:17', '09:26', '14:18', '06:31', '14:13', '14:01', '08:57', '21:34']

In [97]: s = pd.to_datetime(mytime).to_series()

In [98]: s
Out[98]:
2017-04-26 15:23:00   2017-04-26 15:23:00
2017-04-26 14:41:00   2017-04-26 14:41:00
2017-04-26 13:54:00   2017-04-26 13:54:00
2017-04-26 07:13:00   2017-04-26 07:13:00
2017-04-26 20:21:00   2017-04-26 20:21:00
2017-04-26 13:15:00   2017-04-26 13:15:00
2017-04-26 14:48:00   2017-04-26 14:48:00
2017-04-26 12:06:00   2017-04-26 12:06:00
2017-04-26 08:37:00   2017-04-26 08:37:00
2017-04-26 06:32:00   2017-04-26 06:32:00
                              ...
2017-04-26 09:19:00   2017-04-26 09:19:00
2017-04-26 09:10:00   2017-04-26 09:10:00
2017-04-26 09:17:00   2017-04-26 09:17:00
2017-04-26 09:26:00   2017-04-26 09:26:00
2017-04-26 14:18:00   2017-04-26 14:18:00
2017-04-26 06:31:00   2017-04-26 06:31:00
2017-04-26 14:13:00   2017-04-26 14:13:00
2017-04-26 14:01:00   2017-04-26 14:01:00
2017-04-26 08:57:00   2017-04-26 08:57:00
2017-04-26 21:34:00   2017-04-26 21:34:00
dtype: datetime64[ns]

In [106]: s.groupby(pd.cut(s.dt.hour,
     ...:                  bins=np.arange(26, step=2),
     ...:                  right=False,
     ...:                  include_lowest=True)) \
     ...:  .size()
     ...:
Out[106]:
[0, 2)       1
[2, 4)       1
[4, 6)       2
[6, 8)      12
[8, 10)     17
[10, 12)     5
[12, 14)     7
[14, 16)    15
[16, 18)     7
[18, 20)     2
[20, 22)     4
[22, 24)     4
dtype: int64
df = pd.read_csv('/path/to/file.csv', parse_dates=[1], names=['date','time'])

In [55]: df
Out[55]:
        date                time
0     Sunday 2017-04-26 14:35:00
1     Sunday 2017-04-26 10:55:00
2     Friday 2017-04-26 22:29:00
3     Friday 2017-04-26 06:24:00
4   Thursday 2017-04-26 09:25:00
5  Wednesday 2017-04-26 06:45:00

In [59]: df.groupby(pd.cut(df.time.dt.hour, bins=np.arange(26, step=2), include_lowest=True)).size()
Out[59]:
time
[0, 2]      0
(2, 4]      0
(4, 6]      2
(6, 8]      0
(8, 10]     2
(10, 12]    0
(12, 14]    1
(14, 16]    0
(16, 18]    0
(18, 20]    0
(20, 22]    1
(22, 24]    0
dtype: int64

答案 1 :(得分:0)

这就是我得到的,仍在努力排序,你会看到输出:

data = ['15:23', '14:41', '13:54', '07:13', '20:21', '13:15', '14:48', '12:06', '08:37', '06:32', '07:04', '14:20', '16:28',     
'06:49', '08:39', '09:15', '08:54', '05:37', '14:43', '06:20', '11:25', '11:05', '09:28', '14:05', '14:24', '15:30', 
'13:28', '16:55', '09:29', '17:44', '07:24', '09:37', '06:47', '14:35', '10:55', '22:29', '06:24', '09:25', '06:45', 
'23:49', '19:34', '01:31', '14:22', '13:58', '09:08', '05:11', '08:09', '08:52', '02:50', '12:51', '17:33', '07:07', 
'08:11', '10:06', '23:48', '22:27', '11:15', '15:09', '16:45', '20:42', '12:12', '07:08', '16:13', '20:40', '17:26', 
'18:57', '15:07', '09:19', '09:10', '09:17', '09:26', '14:18', '06:31', '14:13', '14:01', '08:57', '21:34']


import pandas as pd

df = pd.DataFrame({'mytime': data})

df['mytime'] = pd.to_datetime(df['mytime']).dt.floor('2H').dt.time
df['hour'] = df.mytime.apply(lambda x: str(x.hour) + '-' + str(x.hour +2)) 
df = df.groupby('hour').size()

答案 2 :(得分:0)

这是一种使用numpy直方图函数的方法:

import numpy as np
data = ['15:23', '14:41', '13:54', '07:13', '20:21', '13:15', '14:48', '12:06', '08:37', '06:32', '07:04', '14:20', '16:28','06:49', '08:39', '09:15','08:54', '05:37', '14:43', '06:20', '11:25', '11:05', '09:28', '14:05','14:24', '15:30', '13:28', '16:55', '09:29', '17:44', '07:24', '09:37','06:47', '14:35', '10:55', '22:29', '06:24', '09:25', '06:45', '23:49','19:34', '01:31', '14:22', '13:58', '09:08', '05:11', '08:09', '08:52','02:50', '12:51', '17:33', '07:07', '08:11', '10:06', '23:48', '22:27','11:15', '15:09', '16:45', '20:42', '12:12', '07:08', '16:13', '20:40','17:26', '18:57', '15:07', '09:19', '09:10', '09:17', '09:26', '14:18', '06:31', '14:13', '14:01', '08:57', '21:34']
time = [int(h) + int(m)/60 for h, m in (y.split(':') for y in data)]
bins = list(range(0, 26, 2))
counts, bins = np.histogram(time, bins)
dict(zip(bins, counts))

结果:

{0: 1,
 2: 1,
 4: 2,
 6: 12,
 8: 17,
 10: 5,
 12: 7,
 14: 15,
 16: 7,
 18: 2,
 20: 4,
 22: 4}