子集数据帧熊猫时间序列

时间:2014-04-28 16:57:40

标签: python pandas time-series subset

**根据提供的答案更新了代码** 实施的解决方案不是对原始数据帧进行子集化。



    In [1]: thresh_eval.head()

    Out[1]:
        WDIR    WSPD    GDR     GST     GTIME
    TX_DTTM                     
    2010-01-01 05:50:00     235     10.9    238     13.4    540
    2010-01-02 00:20:00     329     10.6    NaN     NaN     NaN
    2010-01-02 00:30:00     329     10.8    NaN     NaN     NaN
    2010-01-02 00:40:00     329     12.1    NaN     NaN     NaN
    2010-01-02 00:50:00     332     12.2    330     14.8    46

    In [2]: len(thresh_eval)

    Out[2]: 5503

    In [3]: unique(thresh_eval.index.date)

    Out[3]:

    array([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2),
           datetime.date(2010, 1, 3), datetime.date(2010, 1, 4),
           datetime.date(2010, 1, 6), datetime.date(2010, 1, 8),
           datetime.date(2010, 1, 9), datetime.date(2010, 1, 12),
           datetime.date(2010, 1, 16), datetime.date(2010, 1, 17),
           datetime.date(2010, 1, 18), datetime.date(2010, 1, 21),
           datetime.date(2010, 1, 22), datetime.date(2010, 1, 23),
           datetime.date(2010, 1, 24), datetime.date(2010, 1, 25),
           datetime.date(2010, 1, 26), datetime.date(2010, 1, 27),
           datetime.date(2010, 1, 29), datetime.date(2010, 1, 30),
           datetime.date(2010, 1, 31), datetime.date(2010, 2, 1),
           datetime.date(2010, 2, 2), datetime.date(2010, 2, 3),
           datetime.date(2010, 2, 4), datetime.date(2010, 2, 5),
           datetime.date(2010, 2, 6), datetime.date(2010, 2, 7),
           datetime.date(2010, 2, 9), datetime.date(2010, 2, 10),
           datetime.date(2010, 2, 11), datetime.date(2010, 2, 12),
           datetime.date(2010, 2, 13), datetime.date(2010, 2, 14),
           datetime.date(2010, 2, 15), datetime.date(2010, 2, 16),
           datetime.date(2010, 2, 17), datetime.date(2010, 2, 18),
           datetime.date(2010, 2, 22), datetime.date(2010, 2, 25),
           datetime.date(2010, 2, 26), datetime.date(2010, 2, 27),
           datetime.date(2010, 2, 28), datetime.date(2010, 3, 2),
           datetime.date(2010, 3, 3), datetime.date(2010, 3, 12),
           datetime.date(2010, 3, 13), datetime.date(2010, 3, 14),
           datetime.date(2010, 3, 15), datetime.date(2010, 3, 18),
           datetime.date(2010, 3, 21), datetime.date(2010, 3, 22),
           datetime.date(2010, 3, 23), datetime.date(2010, 3, 26),
           datetime.date(2010, 3, 27), datetime.date(2010, 3, 28),
           datetime.date(2010, 3, 29), datetime.date(2010, 3, 30),
           datetime.date(2010, 4, 9), datetime.date(2010, 4, 17),
           datetime.date(2010, 4, 18), datetime.date(2010, 4, 25),
           datetime.date(2010, 4, 26), datetime.date(2010, 4, 27),
           datetime.date(2010, 4, 28), datetime.date(2010, 5, 3),
           datetime.date(2010, 5, 8), datetime.date(2010, 5, 9),
           datetime.date(2010, 5, 17), datetime.date(2010, 5, 24),
           datetime.date(2010, 5, 25), datetime.date(2010, 5, 26),
           datetime.date(2010, 6, 2), datetime.date(2010, 6, 3),
           datetime.date(2010, 6, 6), datetime.date(2010, 6, 7),
           datetime.date(2010, 6, 16), datetime.date(2010, 6, 28),
           datetime.date(2010, 7, 2), datetime.date(2010, 7, 3),
           datetime.date(2010, 7, 10), datetime.date(2010, 7, 16),
           datetime.date(2010, 7, 22), datetime.date(2010, 7, 26),
           datetime.date(2010, 7, 28), datetime.date(2010, 7, 30),
           datetime.date(2010, 8, 1), datetime.date(2010, 8, 7),
           datetime.date(2010, 8, 23), datetime.date(2010, 8, 24),
           datetime.date(2010, 9, 2), datetime.date(2010, 9, 12),
           datetime.date(2010, 9, 27), datetime.date(2010, 9, 29),
           datetime.date(2010, 9, 30), datetime.date(2010, 10, 2),
           datetime.date(2010, 10, 3), datetime.date(2010, 10, 15),
           datetime.date(2010, 10, 16), datetime.date(2010, 10, 25),
           datetime.date(2010, 10, 26), datetime.date(2010, 10, 27),
           datetime.date(2010, 10, 29), datetime.date(2010, 11, 2),
           datetime.date(2010, 11, 3), datetime.date(2010, 11, 4),
           datetime.date(2010, 11, 5), datetime.date(2010, 11, 6),
           datetime.date(2010, 11, 7), datetime.date(2010, 11, 9),
           datetime.date(2010, 11, 12), datetime.date(2010, 11, 16),
           datetime.date(2010, 11, 17), datetime.date(2010, 11, 26),
           datetime.date(2010, 11, 27), datetime.date(2010, 11, 28),
           datetime.date(2010, 11, 29), datetime.date(2010, 11, 30),
           datetime.date(2010, 12, 1), datetime.date(2010, 12, 2),
           datetime.date(2010, 12, 4), datetime.date(2010, 12, 5),
           datetime.date(2010, 12, 6), datetime.date(2010, 12, 7),
           datetime.date(2010, 12, 11), datetime.date(2010, 12, 12),
           datetime.date(2010, 12, 13), datetime.date(2010, 12, 14),
           datetime.date(2010, 12, 16), datetime.date(2010, 12, 17),
           datetime.date(2010, 12, 18), datetime.date(2010, 12, 19),
           datetime.date(2010, 12, 20), datetime.date(2010, 12, 22),
           datetime.date(2010, 12, 23), datetime.date(2010, 12, 24),
           datetime.date(2010, 12, 26), datetime.date(2010, 12, 27),
           datetime.date(2010, 12, 28)], dtype=object)

    In [4]: ais.head()

    Out[4]:
        MMSI    LAT     LON     COURSE_OVER_GROUND  NAV_STATUS  POS_ACCURACY    RATE_OF_TURN    SPEED_OVER_GROUND   HEADING
    TX_DTTM                                     
    2010-01-01 00:00:19     12345678    32.834746   -79.929589  1820    0   0   128     71  NaN
    2010-01-01 00:00:29     12345678    32.834384   -79.929602  1832    0   0   128     71  NaN
    2010-01-01 00:00:40     12345678    32.834058   -79.929619  1836    0   0   128     70  NaN
    2010-01-01 00:00:50     12345678    32.833703   -79.929647  1847    0   0   128     70  NaN
    2010-01-01 00:01:00     12345678    32.833386   -79.929689  1897    0   0   128     69  NaN

    In [5]: unique(ais.index.date)

    Out[5]:

    array([datetime.date(2010, 1, 1), datetime.date(2010, 1, 4),
           datetime.date(2010, 1, 5), datetime.date(2010, 1, 6),
           datetime.date(2010, 1, 7), datetime.date(2010, 1, 8),
           datetime.date(2010, 1, 9), datetime.date(2010, 1, 10),
           datetime.date(2010, 1, 11), datetime.date(2010, 1, 12),
           datetime.date(2010, 1, 13), datetime.date(2010, 1, 14),
           datetime.date(2010, 1, 15), datetime.date(2010, 1, 16),
           datetime.date(2010, 1, 17), datetime.date(2010, 1, 18),
           datetime.date(2010, 1, 19), datetime.date(2010, 1, 20),
           datetime.date(2010, 1, 21), datetime.date(2010, 1, 22),
           datetime.date(2010, 1, 23), datetime.date(2010, 1, 24),
           datetime.date(2010, 1, 25), datetime.date(2010, 1, 26),
           datetime.date(2010, 1, 27), datetime.date(2010, 1, 28),
           datetime.date(2010, 1, 29), datetime.date(2010, 1, 30),
           datetime.date(2010, 1, 31), datetime.date(2010, 2, 1)], dtype=object)

    In [6]: len(ais)

    Out[6]: 2750499

    In [7]: ais[Index(ais.index.date).isin(Index(thresh_eval.index.date))]

    Out[7]:
        MMSI    LAT     LON     COURSE_OVER_GROUND  NAV_STATUS  POS_ACCURACY    RATE_OF_TURN    SPEED_OVER_GROUND   HEADING
    TX_DTTM                                     
    2010-01-01 00:00:19     12345678    32.834746   -79.929589  1820    0   0   128     71  NaN
    2010-01-01 00:00:29     12345678    32.834384   -79.929602  1832    0   0   128     71  NaN
    2010-01-01 00:00:40     12345678    32.834058   -79.929619  1836    0   0   128     70  NaN
    2010-01-01 00:00:50     12345678    32.833703   -79.929647  1847    0   0   128     70  NaN
    2010-01-01 00:01:00     12345678    32.833386   -79.929689  1897    0   0   128     69  NaN
    2010-01-01 00:01:06     12345678    32.833106   -79.929757  1934    0   0   128     69  NaN
    2010-01-01 00:01:16     12345678    32.832830   -79.929850  1978    0   0   128     69  NaN
    2010-01-01 00:01:26     12345678    32.832495   -79.929990  2010    0   0   128     69  NaN

    In [8]: len(ais)

    Out[8]: 2750499

    In [9]: unique(ais.index.date)

    Out[9]:

    array([datetime.date(2010, 1, 1), datetime.date(2010, 1, 4),
           datetime.date(2010, 1, 5), datetime.date(2010, 1, 6),
           datetime.date(2010, 1, 7), datetime.date(2010, 1, 8),
           datetime.date(2010, 1, 9), datetime.date(2010, 1, 10),
           datetime.date(2010, 1, 11), datetime.date(2010, 1, 12),
           datetime.date(2010, 1, 13), datetime.date(2010, 1, 14),
           datetime.date(2010, 1, 15), datetime.date(2010, 1, 16),
           datetime.date(2010, 1, 17), datetime.date(2010, 1, 18),
           datetime.date(2010, 1, 19), datetime.date(2010, 1, 20),
           datetime.date(2010, 1, 21), datetime.date(2010, 1, 22),
           datetime.date(2010, 1, 23), datetime.date(2010, 1, 24),
           datetime.date(2010, 1, 25), datetime.date(2010, 1, 26),
           datetime.date(2010, 1, 27), datetime.date(2010, 1, 28),
           datetime.date(2010, 1, 29), datetime.date(2010, 1, 30),
           datetime.date(2010, 1, 31), datetime.date(2010, 2, 1)], dtype=object)

**原始问题:** 我正在尝试基于其日期时间索引与另一个数据帧的日期时间索引之间的比较来对数据帧进行子集化。 df1是用作过滤器的下采样时间序列的数据帧。 df2是要过滤的记录的数据帧,具有更高的时间分辨率,并且每个日期的多个记录出现在df1中:

In [1]: df1
    Out[1]:
                 WSPD        cd
    date                           
    2010-07-10  11.325645  0.000019
    2010-08-23  12.258462  0.000019
    2010-11-09  10.771429  0.000019
    2010-11-12  10.650000  0.000019
    2010-11-16  11.939535  0.000019
    ...

    In [2]: df2
    Out[2]:
                             ID   Latitude  Longitude  Course  RateOfTurn  
    TimeStamp                                                                  
    2010-06-26 22:36:11  311425000  32.832500 -79.929000       3           0   
    2010-06-26 22:36:21  311425000  32.832845 -79.929037       3           0   
    2010-06-26 22:36:32  311425000  32.833333 -79.929000       3           0   
    2010-06-26 22:36:42  311425000  32.833666 -79.929000       3           0 
    2010-07-10 07:37:21  548723000  32.832333 -79.929000     1.0           0   
    2010-07-10 07:37:31  548723000  32.832666 -79.929000     1.0           0   
    2010-07-10 07:37:40  548723000  32.833000 -79.929000     2.0           0   
    2010-07-10 07:37:51  548723000  32.833333 -79.929000     1.0           0   
    2010-07-10 07:38:04  548723000  32.833666 -79.929000     0.0           0   
    2010-08-23 09:29:48  311425000  32.832590 -79.928985     0.0           0   
    2010-08-23 09:30:00  311425000  32.833053 -79.928970     1.0           0   
    2010-08-23 09:30:10  311425000  32.833443 -79.928957     1.0           0   
    2010-08-23 09:30:18  311425000  32.833746 -79.928944     2.0           0   
    ...

    In [3]: list = []
            for i,item in enumerate(df2.index.date): 
                if item in df1.index.date:
                    list.append(item)

    In [4]: list
    out[4]: [datetime.date(2010, 8, 23),
     datetime.date(2010, 8, 23),
     datetime.date(2010, 8, 23),
     datetime.date(2010, 8, 23),
     datetime.date(2010, 7, 10),
     datetime.date(2010, 7, 10),
     datetime.date(2010, 7, 10),
     datetime.date(2010, 7, 10),
     datetime.date(2010, 7, 10)]

我正在丢失索引之外的内容。我真的很喜欢来自df2的记录子集,包括所有数据,其日期时间与日间频率的df1相匹配,如:


    2010-07-10 07:37:21  548723000  32.832333 -79.929000     1.0           0   
    2010-07-10 07:37:31  548723000  32.832666 -79.929000     1.0           0   
    2010-07-10 07:37:40  548723000  32.833000 -79.929000     2.0           0   
    2010-07-10 07:37:51  548723000  32.833333 -79.929000     1.0           0   
    2010-07-10 07:38:04  548723000  32.833666 -79.929000     0.0           0   
    2010-08-23 09:29:48  311425000  32.832590 -79.928985     0.0           0   
    2010-08-23 09:30:00  311425000  32.833053 -79.928970     1.0           0   
    2010-08-23 09:30:10  311425000  32.833443 -79.928957     1.0           0   
    2010-08-23 09:30:18  311425000  32.833746 -79.928944     2.0           0   

任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:2)

使用isin方法:

In [33]: import datetime

In [34]: import pandas as pd

In [35]: from pandas import DataFrame, Index

In [36]: from numpy.random import randn, unique, array

In [37]: df1 = DataFrame({'lat': randn(48), 'long': randn(48)}, index=pd.date_range('2013-01-02',periods=4
8,freq='H'))

In [38]: df2 = DataFrame({'lat': randn(72), 'long': randn(72)}, index=pd.date_range('2013-01-02',periods=7
2,freq='H'))

In [39]: df1.head()
Out[39]:
                        lat    long
2013-01-02 00:00:00  0.7310  0.3083
2013-01-02 01:00:00  1.8540  0.7355
2013-01-02 02:00:00  0.3097 -0.1834
2013-01-02 03:00:00  0.8455  0.8350
2013-01-02 04:00:00  0.4017  0.0559

[5 rows x 2 columns]

In [40]: df2.head()
Out[40]:
                        lat    long
2013-01-02 00:00:00  1.4248  0.2289
2013-01-02 01:00:00 -0.5055  0.1072
2013-01-02 02:00:00 -1.8265 -1.0651
2013-01-02 03:00:00  0.5888  0.3992
2013-01-02 04:00:00 -1.5210  0.0710

[5 rows x 2 columns]

In [41]: df2[Index(df2.index.date).isin(Index(df1.index.date))]
Out[41]:
                        lat    long
2013-01-02 00:00:00  1.4248  0.2289
2013-01-02 01:00:00 -0.5055  0.1072
2013-01-02 02:00:00 -1.8265 -1.0651
2013-01-02 03:00:00  0.5888  0.3992
2013-01-02 04:00:00 -1.5210  0.0710
2013-01-02 05:00:00  0.8382 -1.5569
2013-01-02 06:00:00 -0.7878  0.9253
2013-01-02 07:00:00 -0.1686 -1.0128
2013-01-02 08:00:00 -0.2481 -0.4247
2013-01-02 09:00:00  0.0794 -0.1947
2013-01-02 10:00:00 -0.5046 -0.1535
2013-01-02 11:00:00  0.0696 -1.5125
2013-01-02 12:00:00  1.1984 -0.1880
2013-01-02 13:00:00  0.8251 -0.2588
2013-01-02 14:00:00  1.5858 -1.2998
2013-01-02 15:00:00  0.2727 -0.3030
2013-01-02 16:00:00  0.9459 -0.8018
2013-01-02 17:00:00 -1.5055 -1.1344
2013-01-02 18:00:00  0.3970  0.7449
2013-01-02 19:00:00 -1.0256  0.2245
2013-01-02 20:00:00  0.8322  0.6473
2013-01-02 21:00:00  0.2759  1.4096
2013-01-02 22:00:00 -0.5167  1.5676
2013-01-02 23:00:00  0.4620  0.4936
2013-01-03 00:00:00  1.4400  0.5696
                        ...     ...

[48 rows x 2 columns]

您可以通过比较

来检查结果是否仅包含它们在日间频率上重叠的日期索引
In [42]: unique(df2[Index(df2.index.date).isin(Index(df1.index.date))].index.date)
Out[42]: array([datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)], dtype=object)

In [43]: unique(df1.index.date)
Out[43]: array([datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)], dtype=object)