检查Pandas列值是否不在列表中

时间:2018-03-16 14:16:58

标签: python pandas

我有一个由10分钟时间间隔内的计数组成的数据帧,如果时间间隔不存在,我如何设置count = 0?

DF1

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
                    'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
                    'Time': np.random.randint(1, 86400, size=10000),
                    'COUNT': np.random.randint(1, 700, size=10000)})

df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
print(df)

      COUNT     City        Day      Time
0       441  PHOENIX   Thursday  10:20:00
1       641  ATLANTA     Monday  14:30:00
2       661  PHOENIX   Saturday  03:50:00
3       570    MIAMI    Tuesday  21:00:00
4       222  CHICAGO     Friday  15:00:00

DF2 - 我的方法是在一天内创建所有10分钟的时段(6 * 24 = 144个条目),然后使用“不在”

df2 = pd.DataFrame({'TIME_BIN': np.arange(0, 86401, 600), })
df2['TIME_BIN'] = pd.to_datetime(df2['TIME_BIN'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')

     TIME_BIN
0    00:00:00
1    00:10:00
2    00:20:00
3    00:30:00
4    00:40:00
5    00:50:00
6    01:00:00
7    01:10:00
8    01:20:00

如何检查DF1中每个城市和日期DF2中是否存在时隙,如果是,则设置count = 0?我基本上只需要填写DF1中所有缺少的时间段。

尝试:

for each_city in df.City.unique():
    for each_day in df.Day.unique():
        df['Time'] = df.apply(lambda row: df2['TIME_BIN'] if row['Time'] not in (df2['TIME_BIN'].tolist()) else None)

3 个答案:

答案 0 :(得分:1)

我认为MultiIndex reindex需要from_product

np.random.seed(123)
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
                    'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
                    'Time': np.random.randint(1, 86400, size=10000),
                    'COUNT': np.random.randint(1, 700, size=10000)})

df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
df = df.drop_duplicates(['City','Day','Time'])
#print(df)
times = (pd.to_datetime(pd.Series(np.arange(0, 86401, 600)), unit='s')
           .dt.round('10min')
           .dt.strftime('%H:%M:%S'))
mux = pd.MultiIndex.from_product([df['City'].unique(),
                                  df['Day'].unique(), 
                                  times],names=['City','Day','Time'])
df = (df.set_index(['City','Day','Time'])
        .reindex(mux, fill_value=0)
        .reset_index())

print (df.head(20))
       City        Day      Time  COUNT
0   CHICAGO  Wednesday  00:00:00     66
1   CHICAGO  Wednesday  00:10:00    205
2   CHICAGO  Wednesday  00:20:00    260
3   CHICAGO  Wednesday  00:30:00    127
4   CHICAGO  Wednesday  00:40:00    594
5   CHICAGO  Wednesday  00:50:00    683
6   CHICAGO  Wednesday  01:00:00    203
7   CHICAGO  Wednesday  01:10:00      0
8   CHICAGO  Wednesday  01:20:00    372
9   CHICAGO  Wednesday  01:30:00    109
10  CHICAGO  Wednesday  01:40:00     32
11  CHICAGO  Wednesday  01:50:00    184
12  CHICAGO  Wednesday  02:00:00    630
13  CHICAGO  Wednesday  02:10:00    108
14  CHICAGO  Wednesday  02:20:00     35
15  CHICAGO  Wednesday  02:30:00    604
16  CHICAGO  Wednesday  02:40:00    500
17  CHICAGO  Wednesday  02:50:00    367
18  CHICAGO  Wednesday  03:00:00    118
19  CHICAGO  Wednesday  03:10:00    546

答案 1 :(得分:1)

一种方法是转换为类别并使用groupby来计算笛卡尔积。

事实上,鉴于您的数据在很大程度上是绝对的,这是一个好主意,并会为大量的时间 - 城市日组合带来记忆效益。

for col in ['Time', 'City', 'Day']:
    df[col] = df[col].astype('category')

bin_cats = sorted(set(pd.Series(pd.to_datetime(np.arange(0, 86401, 600), unit='s'))\
                                .dt.round('10min').dt.strftime('%H:%M:%S')))

df['Time'] = df['Time'].cat.set_categories(bin_cats, ordered=True)

res = df.groupby(['Time', 'City', 'Day'], as_index=False)['COUNT'].sum()
res['COUNT'] = res['COUNT'].fillna(0).astype(int)

#           Time     City        Day  COUNT
# 0     00:00:00  ATLANTA     Friday    521
# 1     00:00:00  ATLANTA     Monday    767
# 2     00:00:00  ATLANTA   Saturday    474
# 3     00:00:00  ATLANTA     Sunday   1126
# 4     00:00:00  ATLANTA   Thursday    157
# 5     00:00:00  ATLANTA    Tuesday    720
# 6     00:00:00  ATLANTA  Wednesday      0
# 7     00:00:00  CHICAGO     Friday   1114
# 8     00:00:00  CHICAGO     Monday    813
# 9     00:00:00  CHICAGO   Saturday    137
# 10    00:00:00  CHICAGO     Sunday    134
# 11    00:00:00  CHICAGO   Thursday      0
# 12    00:00:00  CHICAGO    Tuesday    168
# ..........

答案 2 :(得分:0)

然后你可以试试

df.groupby(['City','Day']).apply(lambda x : x.set_index('Time').reindex(df2.TIME_BIN.unique()).fillna({'COUNT':0}).ffill())