使用loc函数查找日期pandas之间的差异

时间:2017-12-28 15:44:25

标签: python pandas dataframe data-analysis

我有这个数据框

                     open      high       low     close      volume
TimeStamp                                                              
2017-12-22 13:15:00  12935.00  13200.00  12508.71  12514.91  244.728611
2017-12-22 13:30:00  12514.91  12999.99  12508.71  12666.34  150.457869
2017-12-22 13:45:00  12666.33  12899.97  12094.00  12094.00  198.680014
2017-12-22 14:00:00  12094.01  12354.99  11150.00  11150.00  256.812634
2017-12-22 14:15:00  11150.01  12510.00  10400.00  12276.33  262.217127

我想知道每行是否在时间上有15分钟的差异 所以我构建了一个新列,其中第一列的移位

                         open      high       low     close      volume  \
TimeStamp                                                                 
2017-12-20 13:30:00  17503.98  17600.00  17100.57  17119.89  312.773644   
2017-12-20 13:45:00  17119.89  17372.98  17049.00  17170.00  322.953671   
2017-12-20 14:00:00  17170.00  17573.00  17170.00  17395.74  236.085829   
2017-12-20 14:15:00  17395.74  17398.00  17200.01  17280.00  220.467382   
2017-12-20 14:30:00  17280.00  17313.94  17150.00  17256.05  222.760598   

                                new_time  
TimeStamp                                 
2017-12-20 13:30:00  2017-12-20 13:45:00  
2017-12-20 13:45:00  2017-12-20 14:00:00  
2017-12-20 14:00:00  2017-12-20 14:15:00  
2017-12-20 14:15:00  2017-12-20 14:30:00  
2017-12-20 14:30:00  2017-12-20 14:45:00  

现在我想找到不遵守15分钟差异规则的每一行,所以我做了

dfh.loc[(dfh['new_time'].to_pydatetime()-dfh.index.to_pydatetime())>datetime.timedelta(0, 900)]

我收到此错误,

    Traceback (most recent call last):
  File "<pyshell#252>", line 1, in <module>
    dfh.loc[(dfh['new_time'].to_pydatetime()-dfh.index.to_pydatetime())>datetime.timedelta(0, 900)]
  File "C:\Users\Araujo\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 3614, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'to_pydatetime'

有没有办法做到这一点?

编辑:

Shift只适用于周期性,有任何方法可以用非周期性的方法吗?

2 个答案:

答案 0 :(得分:1)

这样可行:

import pandas as pd
import numpy as np
import datetime as dt

data = [            
['2017-12-22 13:15:00',  12935.00,  13200.00,  12508.71,  12514.91,  244.728611],
['2017-12-22 13:30:00',  12514.91,  12999.99,  12508.71,  12666.34,  150.457869],
['2017-12-22 13:45:00',  12666.33,  12899.97,  12094.00,  12094.00,  198.680014],
['2017-12-22 14:00:00',  12094.01,  12354.99,  11150.00,  11150.00,  256.812634],
['2017-12-22 14:15:00',  11150.01,  12510.00,  10400.00,  12276.33,  262.217127]
]

df = pd.DataFrame(data, columns = ['Timestamp', 'open', 'high', 'low', 'close', 'volume'])

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

df['plus_15'] = df['Timestamp'].shift(1) + dt.timedelta(minutes = 15)

df['valid_time'] = np.where((df['Timestamp'] == df['plus_15']) | (df.index == 0), 1, 0)

print(df[['Timestamp', 'valid_time']])

#output
            Timestamp  valid_time
0 2017-12-22 13:15:00           1
1 2017-12-22 13:30:00           1
2 2017-12-22 13:45:00           1
3 2017-12-22 14:00:00           1
4 2017-12-22 14:15:00           1

因此,创建一个新列,加上15,查看上一个时间戳并添加15分钟。然后创建另一列有效时间,它将时间戳列与正15列进行比较,并在它们相等时标记为1,在不相等时标记为0.

答案 1 :(得分:0)

我们可以这样做吗?

import pandas as pd
import numpy as np

data = '''\
TimeStamp            open      high       low     close      volume
2017-12-22T13:15:00  12935.00  13200.00  12508.71  12514.91  244.728611
2017-12-22T13:30:00  12514.91  12999.99  12508.71  12666.34  150.457869
2017-12-22T13:45:00  12666.33  12899.97  12094.00  12094.00  198.680014
2017-12-22T14:00:00  12094.01  12354.99  11150.00  11150.00  256.812634
2017-12-22T14:15:00  11150.01  12510.00  10400.00  12276.33  262.217127'''

df = pd.read_csv(pd.compat.StringIO(data), 
                 sep='\s+', parse_dates=['TimeStamp'], index_col=['TimeStamp'])

df['new_time'] = df.index[1:].tolist()+[np.NaN]
# df['new_time'] = np.roll(df.index, -1)  # if last is not first+15min

# use boolean indexing to filter away unwanted rows
df[[(dt2-dt1)/np.timedelta64(1, 's') == 900 
    for dt1,dt2 in zip(df.index.values,df.new_time.values)]]