在pandas中只保留每个60秒数据库的第一行的最佳方法是什么?即对于在增加的时间t
发生的每一行,我想删除最多t+60
秒出现的所有行。
我知道我可能会使用groupby().first()
的某种组合,但我见过的代码示例(例如使用pandas.Grouper(freq='60s')
)将丢弃原始日期时间,而不是每60秒偏移一次午夜而不是我原来的约会时间。
例如,以下内容:
time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
2 2016-05-11 13:02:05.760903860 0.155666
3 2016-05-11 13:02:18.325613076 0.157788
4 2016-05-11 13:02:18.486519052 0.157788
5 2016-05-11 13:02:20.243748548 0.157788
6 2016-05-11 13:02:20.533101692 0.157788
7 2016-05-11 13:02:20.646061652 0.157788
8 2016-05-11 13:02:21.121409820 0.157788
9 2016-05-11 13:04:24.660609068 0.211649
10 2016-05-11 13:04:24.660845612 0.211649
11 2016-05-11 13:04:24.660957596 0.211649
12 2016-05-11 13:04:24.661378132 0.211649
13 2016-05-11 13:04:24.661450628 0.211649
14 2016-05-11 13:04:24.661607044 0.211649
应该成为这个:
time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
3 2016-05-11 13:04:24.660609068 0.211649
答案 0 :(得分:3)
更新:感谢@piRSquared - 他发现我之前的解决方案不正确。这是另一种尝试:
数据:
In [8]: df = pd.DataFrame(dict(time=pd.date_range('2001-01-01', periods=20, freq='9S'), value=np.random.rand(20)))
In [9]: df
Out[9]:
time value
0 2001-01-01 00:00:00 0.440696
1 2001-01-01 00:00:09 0.135540
2 2001-01-01 00:00:18 0.008243
3 2001-01-01 00:00:27 0.389259
4 2001-01-01 00:00:36 0.128253
5 2001-01-01 00:00:45 0.566704
6 2001-01-01 00:00:54 0.386797
7 2001-01-01 00:01:03 0.426411
8 2001-01-01 00:01:12 0.438114
9 2001-01-01 00:01:21 0.918711
10 2001-01-01 00:01:30 0.715565
11 2001-01-01 00:01:39 0.422044
12 2001-01-01 00:01:48 0.199396
13 2001-01-01 00:01:57 0.827872
14 2001-01-01 00:02:06 0.986887
15 2001-01-01 00:02:15 0.305749
16 2001-01-01 00:02:24 0.030092
17 2001-01-01 00:02:33 0.338214
18 2001-01-01 00:02:42 0.773635
19 2001-01-01 00:02:51 0.816478
解决方案:
In [10]: df.groupby((df.time - df.loc[0, 'time']).dt.total_seconds() // 60, as_index=False).first()
Out[10]:
time value
0 2001-01-01 00:00:00 0.440696
1 2001-01-01 00:01:03 0.426411
2 2001-01-01 00:02:06 0.986887
说明:
In [17]: (df.time - df.loc[0, 'time']).dt.total_seconds()
Out[17]:
0 0.0
1 9.0
2 18.0
3 27.0
4 36.0
5 45.0
6 54.0
7 63.0
8 72.0
9 81.0
10 90.0
11 99.0
12 108.0
13 117.0
14 126.0
15 135.0
16 144.0
17 153.0
18 162.0
19 171.0
Name: time, dtype: float64
In [18]: (df.time - df.loc[0, 'time']).dt.total_seconds() // 60
Out[18]:
0 -0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 1.0
13 1.0
14 2.0
15 2.0
16 2.0
17 2.0
18 2.0
19 2.0
Name: time, dtype: float64
OLD错误回答:
In [102]: df[df.time.diff().fillna(pd.Timedelta('60S')) >= pd.Timedelta('60S')]
Out[102]:
time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
9 2016-05-11 13:04:24.660609068 0.211649
说明:
答案 1 :(得分:2)
def td60(ta):
d = np.timedelta64(int(6e10))
tp = ta + d
j = 0
yield j
for i, tx in enumerate(ta):
if tx > tp[j]:
yield i
j = i
def pir(df):
slc = list(td60(df.time.values))
return pd.DataFrame(df.values[slc], df.index[slc])
示例用法
pir(df)
pop_n, smp_n = 1000000, 500000
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08', periods=pop_n, freq='5s')
tidx = np.random.choice(tidx, smp_n, False)
tidx = pd.to_datetime(tidx).sort_values()
df = pd.DataFrame(dict(time=tidx, value=np.random.rand(smp_n)))
<强> Cythonize 强>
在Jupyter
%load_ext Cython
%%cython
import numpy as np
import pandas as pd
def td60(ta):
d = np.timedelta64(int(6e10))
tp = ta + d
j = 0
yield j
for i, tx in enumerate(ta):
if tx > tp[j]:
yield i
j = i
def pir(df):
slc = list(td60(df.time.values))
return pd.DataFrame(df.values[slc], df.index[slc])
Cythonizing后
差别不大
from StringIO import StringIO
import pandas as pd
text = """time,value
2016-05-11 13:00:10.841015028,0.215978
2016-05-11 13:02:05.760595780,0.155666
2016-05-11 13:02:05.760903860,0.155666
2016-05-11 13:02:18.325613076,0.157788
2016-05-11 13:02:18.486519052,0.157788
2016-05-11 13:02:20.243748548,0.157788
2016-05-11 13:02:20.533101692,0.157788
2016-05-11 13:02:20.646061652,0.157788
2016-05-11 13:02:21.121409820,0.157788
2016-05-11 13:04:24.660609068,0.211649
2016-05-11 13:04:24.660845612,0.211649
2016-05-11 13:04:24.660957596,0.211649
2016-05-11 13:04:24.661378132,0.211649
2016-05-11 13:04:24.661450628,0.211649
2016-05-11 13:04:24.661607044,0.211649"""
df = pd.read_csv(StringIO(text), parse_dates=[0])