我可以使用日期索引在熊猫中创建假人吗?

时间:2017-08-21 16:10:15

标签: python pandas indexing dummy-variable

我一直在搜索是否可以使用date中编入索引的pandas创建虚拟对象,但找不到任何内容。

我有一个由df

编制索引的date
                        dew    temp   
date
2010-01-02 00:00:00      129.0  -16     
2010-01-02 01:00:00      148.0  -15     
2010-01-02 02:00:00      159.0  -11     
2010-01-02 03:00:00      181.0   -7      
2010-01-02 04:00:00      138.0   -7   
...  

我知道我可以使用

date设置为列
df.reset_index(level=0, inplace=True)

然后使用这样的东西创建假人,

df['main_hours'] = np.where((df['date'] >= '2010-01-02 03:00:00') & (df['date'] <= '2010-01-02 05:00:00')1,0)

但是,我想在不使用date作为列的情况下使用索引date创建虚拟变量。 pandas有这样的方法吗? 任何建议将不胜感激。

3 个答案:

答案 0 :(得分:2)

IIUC:

df['main_hours'] = \
    np.where((df.index  >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'),
             1,
             0)

或:

In [8]: df['main_hours'] = \
            ((df.index >= '2010-01-02 03:00:00') & 
             (df.index <= '2010-01-02 05:00:00')).astype(int)

In [9]: df
Out[9]:
                       dew  temp  main_hours
date
2010-01-02 00:00:00  129.0   -16           0
2010-01-02 01:00:00  148.0   -15           0
2010-01-02 02:00:00  159.0   -11           0
2010-01-02 03:00:00  181.0    -7           1
2010-01-02 04:00:00  138.0    -7           1

时间:为50.000行DF:

In [19]: df = pd.concat([df.reset_index()] * 10**4, ignore_index=True).set_index('date')

In [20]: pd.options.display.max_rows = 10

In [21]: df
Out[21]:
                       dew  temp
date
2010-01-02 00:00:00  129.0   -16
2010-01-02 01:00:00  148.0   -15
2010-01-02 02:00:00  159.0   -11
2010-01-02 03:00:00  181.0    -7
2010-01-02 04:00:00  138.0    -7
...                    ...   ...
2010-01-02 00:00:00  129.0   -16
2010-01-02 01:00:00  148.0   -15
2010-01-02 02:00:00  159.0   -11
2010-01-02 03:00:00  181.0    -7
2010-01-02 04:00:00  138.0    -7

[50000 rows x 2 columns]

In [22]: %timeit ((df.index  >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00')).astype(int)
1.58 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [23]: %timeit np.where((df.index  >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'), 1, 0)
1.52 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [24]: df.shape
Out[24]: (50000, 2)

答案 1 :(得分:2)

或使用between;

pd.Series(df.index).between('2010-01-02 03:00:00',  '2010-01-02 05:00:00', inclusive=True).astype(int)

Out[1567]: 
0    0
1    0
2    0
3    1
4    1
Name: date, dtype: int32

答案 2 :(得分:1)

df = df.assign(main_hours=0)
df.loc[df.between_time(start_time='3:00', end_time='5:00').index, 'main_hours'] = 1
>>> df
                     dew  temp  main_hours
2010-01-02 00:00:00  129   -16           0
2010-01-02 01:00:00  148   -15           0
2010-01-02 02:00:00  159   -11           0
2010-01-02 03:00:00  181    -7           1
2010-01-02 04:00:00  138    -7           1