我有一个由10分钟时间间隔内的计数组成的数据帧,如果时间间隔不存在,我如何设置count = 0?
DF1
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
print(df)
COUNT City Day Time
0 441 PHOENIX Thursday 10:20:00
1 641 ATLANTA Monday 14:30:00
2 661 PHOENIX Saturday 03:50:00
3 570 MIAMI Tuesday 21:00:00
4 222 CHICAGO Friday 15:00:00
DF2 - 我的方法是在一天内创建所有10分钟的时段(6 * 24 = 144个条目),然后使用“不在”
df2 = pd.DataFrame({'TIME_BIN': np.arange(0, 86401, 600), })
df2['TIME_BIN'] = pd.to_datetime(df2['TIME_BIN'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
TIME_BIN
0 00:00:00
1 00:10:00
2 00:20:00
3 00:30:00
4 00:40:00
5 00:50:00
6 01:00:00
7 01:10:00
8 01:20:00
如何检查DF1中每个城市和日期DF2中是否存在时隙,如果是,则设置count = 0?我基本上只需要填写DF1中所有缺少的时间段。
尝试:
for each_city in df.City.unique():
for each_day in df.Day.unique():
df['Time'] = df.apply(lambda row: df2['TIME_BIN'] if row['Time'] not in (df2['TIME_BIN'].tolist()) else None)
答案 0 :(得分:1)
我认为MultiIndex
reindex
需要from_product
:
np.random.seed(123)
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
df = df.drop_duplicates(['City','Day','Time'])
#print(df)
times = (pd.to_datetime(pd.Series(np.arange(0, 86401, 600)), unit='s')
.dt.round('10min')
.dt.strftime('%H:%M:%S'))
mux = pd.MultiIndex.from_product([df['City'].unique(),
df['Day'].unique(),
times],names=['City','Day','Time'])
df = (df.set_index(['City','Day','Time'])
.reindex(mux, fill_value=0)
.reset_index())
print (df.head(20))
City Day Time COUNT
0 CHICAGO Wednesday 00:00:00 66
1 CHICAGO Wednesday 00:10:00 205
2 CHICAGO Wednesday 00:20:00 260
3 CHICAGO Wednesday 00:30:00 127
4 CHICAGO Wednesday 00:40:00 594
5 CHICAGO Wednesday 00:50:00 683
6 CHICAGO Wednesday 01:00:00 203
7 CHICAGO Wednesday 01:10:00 0
8 CHICAGO Wednesday 01:20:00 372
9 CHICAGO Wednesday 01:30:00 109
10 CHICAGO Wednesday 01:40:00 32
11 CHICAGO Wednesday 01:50:00 184
12 CHICAGO Wednesday 02:00:00 630
13 CHICAGO Wednesday 02:10:00 108
14 CHICAGO Wednesday 02:20:00 35
15 CHICAGO Wednesday 02:30:00 604
16 CHICAGO Wednesday 02:40:00 500
17 CHICAGO Wednesday 02:50:00 367
18 CHICAGO Wednesday 03:00:00 118
19 CHICAGO Wednesday 03:10:00 546
答案 1 :(得分:1)
一种方法是转换为类别并使用groupby
来计算笛卡尔积。
事实上,鉴于您的数据在很大程度上是绝对的,这是一个好主意,并会为大量的时间 - 城市日组合带来记忆效益。
for col in ['Time', 'City', 'Day']:
df[col] = df[col].astype('category')
bin_cats = sorted(set(pd.Series(pd.to_datetime(np.arange(0, 86401, 600), unit='s'))\
.dt.round('10min').dt.strftime('%H:%M:%S')))
df['Time'] = df['Time'].cat.set_categories(bin_cats, ordered=True)
res = df.groupby(['Time', 'City', 'Day'], as_index=False)['COUNT'].sum()
res['COUNT'] = res['COUNT'].fillna(0).astype(int)
# Time City Day COUNT
# 0 00:00:00 ATLANTA Friday 521
# 1 00:00:00 ATLANTA Monday 767
# 2 00:00:00 ATLANTA Saturday 474
# 3 00:00:00 ATLANTA Sunday 1126
# 4 00:00:00 ATLANTA Thursday 157
# 5 00:00:00 ATLANTA Tuesday 720
# 6 00:00:00 ATLANTA Wednesday 0
# 7 00:00:00 CHICAGO Friday 1114
# 8 00:00:00 CHICAGO Monday 813
# 9 00:00:00 CHICAGO Saturday 137
# 10 00:00:00 CHICAGO Sunday 134
# 11 00:00:00 CHICAGO Thursday 0
# 12 00:00:00 CHICAGO Tuesday 168
# ..........
答案 2 :(得分:0)
然后你可以试试
df.groupby(['City','Day']).apply(lambda x : x.set_index('Time').reindex(df2.TIME_BIN.unique()).fillna({'COUNT':0}).ffill())