我从MongoDB获取文件如下:
{
"amount": 1200,
"date_closed": "2012-07-02 17:00:00"
},
{
"amount": 0,
"date_closed": "2012-08-03 16:00:00"
},
{
"amount": 0,
"date_closed": "2012-08-04 20:00:00"
},
{
"amount": 0,
"date_closed": "2012-08-04 22:00:00"
}
我从用户(名为1343287040
的参数)获得了user_time
的时间戳,该时间戳引用了日期datetime.datetime(2012, 7, 26, 11, 47, 20)
。
这是我填补空白的解决方案:
现在我通过以下代码创建日期格式YYYY-mm-dd 00:00:00
:
hourly_date = str(datetime.datetime.fromtimestamp(user_time).year) + '-' + str(datetime.datetime.fromtimestamp(user_time).month) + '-' + str(datetime.datetime.fromtimestamp(user_time).day) + ' 00:00:00'
user_time
是开始日期。现在我生成从user_time
到今天的每小时记录。以下代码以我想要的格式生成日期范围(小时):
date_range = pandas.date_range(start=hourly_date, end=datetime.datetime.today(), freq='H')
date_range = date_range.values.astype('<M8[h]').astype(str)
hourly_date = []
for i_hourly in date_range:
tmp_date = pandas.to_datetime(str(i_hourly)).strftime('%Y-%m-%d %H:00:00')
hourly_date.append(tmp_date)
创建从user_time
到今天的小时模板日期范围后,我将其与从MongoDB返回的date_closed
字段进行比较:
records_len = len(records)
for i_hourly in hourly_date:
i = 0
for record in records:
i += 1
if i_hourly in record['date_closed']:
break # break from innermost loop
elif records_len == i and i_hourly not in record['date_closed']:
records.append({"amount": 0, "date_closed": i_hourly})
records
包含许多字段,从2012年到今天,我要解决的问题是看到的是返回文档中缺少的日期和小时。如果它丢失了,那么我们需要将它添加到记录中以填补空白,否则我应该从最里面的循环中断开。
此代码大约需要57秒!这是一个巨大的时间。是否有更好的方法可以在一小时内生成日期差距?
编辑:
amount date_closed
0 21800 2015-07-21 10:00:00
1 5450 2015-07-05 04:00:00
2 571160 2015-06-22 12:00:00
3 65400 2015-06-15 12:00:00
4 10900 2015-06-15 09:00:00
5 109000 2015-06-14 07:00:00
6 109000 2015-06-14 04:00:00
7 1193550 2015-06-11 06:00:00
8 10900 2015-06-11 05:00:00
9 21800 2015-06-09 10:00:00
10 10900 2015-05-31 05:00:00
11 0 2015-05-30 09:00:00
12 114450 2015-05-19 13:00:00
13 261600 2015-05-19 08:00:00
14 108000 2015-05-11 08:00:00
15 2180 2015-05-11 07:00:00
16 344870 2015-05-05 13:00:00
17 70850 2015-05-05 12:00:00
18 5450 2015-05-05 05:00:00
19 109000 2015-05-03 12:00:00
20 327000 2015-05-03 11:00:00
21 310650 2015-04-30 05:00:00
22 38150 2015-04-28 13:00:00
23 26160 2015-04-27 07:00:00
24 109000 2015-04-22 12:00:00
25 97200 2015-03-09 08:00:00
26 21800 2015-07-11 05:00:00
27 26160 2015-05-20 05:00:00
28 37800 2015-03-03 07:00:00
29 130800 2015-06-29 06:00:00
.. ... ...
161 2180 2015-05-25 09:00:00
162 26160 2015-05-09 11:00:00
163 108000 2015-03-03 11:00:00
164 3337200 2014-09-13 05:00:00
165 5249880 2014-09-10 05:00:00
166 712800 2014-08-10 09:00:00
167 151200 2015-02-23 06:00:00
168 48600 2014-08-10 11:00:00
169 6540 2015-04-19 10:00:00
170 172800 2014-09-01 09:00:00
171 1370520 2014-10-15 09:00:00
172 421200 2014-07-26 09:00:00
173 86400 2015-03-01 12:00:00
174 118800 2015-02-21 12:00:00
175 97200 2014-09-17 07:00:00
176 54500 2015-04-23 07:00:00
177 1185840 2014-09-09 06:00:00
178 119016 2015-02-18 09:00:00
179 32400 2014-11-05 08:00:00
180 345600 2014-08-09 10:00:00
181 151200 2015-02-18 12:00:00
182 168480 2014-10-09 06:00:00
183 5668920 2014-10-04 21:00:00
184 669600 2014-08-06 12:00:00
185 194400 2014-08-02 07:00:00
186 313920 2015-06-23 08:00:00
187 6540 2015-05-04 09:00:00
188 669600 2014-07-23 10:00:00
189 64800 2015-01-22 06:00:00
190 669600 2014-08-25 04:00:00
[191 rows x 2 columns]
它显示我只有191条记录,这些记录是从Mongo返回的!我希望看到一个每小时生成的列表列表,大约有121000条记录,其中191条记录将由上面的代码填充。
问题在于我认为这两个列表没有合并在一起。
答案 0 :(得分:1)
您可以先将date_closed
列作为索引,然后根据.reindex
hourly_date_rng
填充缺失的记录。
这是一个例子。
import json
import pandas as pd
json_data = [
{
"amount": 0,
"date_closed": "2012-08-04 16:00:00"
},
{
"amount": 0,
"date_closed": "2012-08-04 20:00:00"
},
{
"amount": 0,
"date_closed": "2012-08-04 22:00:00"
}
]
df = pd.read_json(json.dumps(json_data), orient='records')
df
amount date_closed
0 0 2012-08-03 16:00:00
1 0 2012-08-04 20:00:00
2 0 2012-08-04 22:00:00
hourly_date_rng
看起来像这样
hourly_date_rng = pd.date_range(start='2012-08-04 12:00:00', end='2012-08-4 23:00:00', freq='H')
hourly_date_rng.name = 'date_closed'
hourly_date_rng
DatetimeIndex(['2012-08-04 12:00:00', '2012-08-04 13:00:00',
'2012-08-04 14:00:00', '2012-08-04 15:00:00',
'2012-08-04 16:00:00', '2012-08-04 17:00:00',
'2012-08-04 18:00:00', '2012-08-04 19:00:00',
'2012-08-04 20:00:00', '2012-08-04 21:00:00',
'2012-08-04 22:00:00', '2012-08-04 23:00:00'],
dtype='datetime64[ns]', name='date_closed', freq='H', tz=None)
对齐索引并填补空白
# make the column datetime object instead of string
df['date_closed'] = pd.to_datetime(df['date_closed'])
# align the index using .reindex
df.set_index('date_closed').reindex(hourly_date_rng).fillna(0).reset_index()
date_closed amount
0 2012-08-04 12:00:00 0
1 2012-08-04 13:00:00 0
2 2012-08-04 14:00:00 0
3 2012-08-04 15:00:00 0
4 2012-08-04 16:00:00 0
5 2012-08-04 17:00:00 0
6 2012-08-04 18:00:00 0
7 2012-08-04 19:00:00 0
8 2012-08-04 20:00:00 0
9 2012-08-04 21:00:00 0
10 2012-08-04 22:00:00 0
11 2012-08-04 23:00:00 0
将结果转换回JSON。
result = df.set_index('date_closed').reindex(hourly_date_rng).fillna(0).reset_index()
# maybe convert date_closed column to string first
result['date_closed'] = pd.DatetimeIndex(result['date_closed']).to_native_types()
# to json function
json_result = result.to_json(orient='records')
# print out the data with pretty print
from pprint import pprint
pprint(json.loads(json_result))
[{'amount': 0.0, 'date_closed': '2012-08-04 12:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 13:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 14:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 15:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 16:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 17:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 18:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 19:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 20:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 21:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 22:00:00'},
{'amount': 0.0, 'date_closed': '2012-08-04 23:00:00'}]