我有一个由时间戳索引的大型数据帧,我希望根据时间范围将行分配给组。
例如,在以下数据中,我将行分组在组中第一个条目的1毫秒内。
groupid
1999-12-31 23:59:59.000107 1
1999-12-31 23:59:59.000385 1
1999-12-31 23:59:59.000404 1
1999-12-31 23:59:59.000704 1
1999-12-31 23:59:59.001281 2
1999-12-31 23:59:59.002211 2
1999-12-31 23:59:59.002367 3
我有工作代码通过迭代行来完成此操作,并使用当前行来切片数据帧:
dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)
print df.head(20)
groupid = 1
for dt, row in df.iterrows():
if df.loc[row.name].groupid:
continue
end = dt + timedelta(milliseconds=1)
group = df.loc[dt:end]
df.loc[group.index, 'groupid'] = groupid
groupid += 1
print df.head(20)
然而,正如iterrows一样,大型数据帧的操作速度很慢。我已经尝试过应用函数和使用groupby,但没有成功。使用itertuples是我能做的最好的性能提升(我现在要尝试)?有人可以给点建议吗?
答案 0 :(得分:3)
好的,我认为以下是你想要的,它通过用第一个值减去所有值来从你的索引构造一个TimeDelta。然后我们访问微秒组件并除以1000,然后将Series dtype转换为int:
In [86]:
df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
groupid
1999-12-31 23:59:59.000133 0
1999-12-31 23:59:59.000584 0
1999-12-31 23:59:59.003544 3
1999-12-31 23:59:59.009193 9
1999-12-31 23:59:59.010220 10
1999-12-31 23:59:59.010632 10
1999-12-31 23:59:59.010716 10
1999-12-31 23:59:59.011387 11
1999-12-31 23:59:59.011837 11
1999-12-31 23:59:59.013277 13
1999-12-31 23:59:59.013305 13
1999-12-31 23:59:59.014754 14
1999-12-31 23:59:59.016015 15
1999-12-31 23:59:59.016067 15
1999-12-31 23:59:59.017788 17
1999-12-31 23:59:59.018236 18
1999-12-31 23:59:59.021281 21
1999-12-31 23:59:59.021772 21
1999-12-31 23:59:59.021927 21
1999-12-31 23:59:59.022200 22
1999-12-31 23:59:59.023104 22
1999-12-31 23:59:59.023375 23
1999-12-31 23:59:59.023688 23
1999-12-31 23:59:59.023726 23
1999-12-31 23:59:59.025397 25
1999-12-31 23:59:59.026407 26
1999-12-31 23:59:59.026480 26
1999-12-31 23:59:59.027825 27
1999-12-31 23:59:59.028793 28
1999-12-31 23:59:59.030716 30
... ...
1999-12-31 23:59:59.975432 975
1999-12-31 23:59:59.976699 976
1999-12-31 23:59:59.977177 977
1999-12-31 23:59:59.979475 979
1999-12-31 23:59:59.980282 980
1999-12-31 23:59:59.980672 980
1999-12-31 23:59:59.983202 983
1999-12-31 23:59:59.984214 984
1999-12-31 23:59:59.984674 984
1999-12-31 23:59:59.984933 984
1999-12-31 23:59:59.985664 985
1999-12-31 23:59:59.985779 985
1999-12-31 23:59:59.988812 988
1999-12-31 23:59:59.989324 989
1999-12-31 23:59:59.990386 990
1999-12-31 23:59:59.990485 990
1999-12-31 23:59:59.990969 990
1999-12-31 23:59:59.991255 991
1999-12-31 23:59:59.991739 991
1999-12-31 23:59:59.993979 993
1999-12-31 23:59:59.994705 994
1999-12-31 23:59:59.994874 994
1999-12-31 23:59:59.995397 995
1999-12-31 23:59:59.995753 995
1999-12-31 23:59:59.995863 995
1999-12-31 23:59:59.996574 996
1999-12-31 23:59:59.998139 998
1999-12-31 23:59:59.998533 998
1999-12-31 23:59:59.998778 998
1999-12-31 23:59:59.999915 999
感谢@Jeff指出更清洁的方法:
In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df
Out[96]:
groupid
1999-12-31 23:59:59.000884 0
1999-12-31 23:59:59.001175 0
1999-12-31 23:59:59.001262 0
1999-12-31 23:59:59.001540 0
1999-12-31 23:59:59.001769 0
1999-12-31 23:59:59.002478 1
1999-12-31 23:59:59.005001 4
1999-12-31 23:59:59.005497 4
1999-12-31 23:59:59.006908 6
1999-12-31 23:59:59.008860 7
1999-12-31 23:59:59.009257 8
1999-12-31 23:59:59.010012 9
1999-12-31 23:59:59.011451 10
1999-12-31 23:59:59.013177 12
1999-12-31 23:59:59.014138 13
1999-12-31 23:59:59.015795 14
1999-12-31 23:59:59.015865 14
1999-12-31 23:59:59.016069 15
1999-12-31 23:59:59.016666 15
1999-12-31 23:59:59.016718 15
1999-12-31 23:59:59.019058 18
1999-12-31 23:59:59.019675 18
1999-12-31 23:59:59.020747 19
1999-12-31 23:59:59.021856 20
1999-12-31 23:59:59.022959 22
1999-12-31 23:59:59.023812 22
1999-12-31 23:59:59.023938 23
1999-12-31 23:59:59.024122 23
1999-12-31 23:59:59.025332 24
1999-12-31 23:59:59.025397 24
... ...
1999-12-31 23:59:59.959725 958
1999-12-31 23:59:59.959742 958
1999-12-31 23:59:59.959892 959
1999-12-31 23:59:59.960345 959
1999-12-31 23:59:59.960800 959
1999-12-31 23:59:59.961054 960
1999-12-31 23:59:59.962749 961
1999-12-31 23:59:59.965681 964
1999-12-31 23:59:59.966409 965
1999-12-31 23:59:59.966558 965
1999-12-31 23:59:59.967357 966
1999-12-31 23:59:59.967842 966
1999-12-31 23:59:59.970465 969
1999-12-31 23:59:59.974022 973
1999-12-31 23:59:59.974734 973
1999-12-31 23:59:59.975879 974
1999-12-31 23:59:59.978291 977
1999-12-31 23:59:59.980483 979
1999-12-31 23:59:59.980868 979
1999-12-31 23:59:59.981417 980
1999-12-31 23:59:59.984208 983
1999-12-31 23:59:59.984639 983
1999-12-31 23:59:59.985533 984
1999-12-31 23:59:59.986785 985
1999-12-31 23:59:59.987502 986
1999-12-31 23:59:59.987914 987
1999-12-31 23:59:59.988406 987
1999-12-31 23:59:59.989436 988
1999-12-31 23:59:59.994449 993
1999-12-31 23:59:59.996657 995
答案 1 :(得分:1)
这就像重新采样操作。
创建数据
In [39]: pd.set_option('max_rows',12)
In [40]: np.random.seed(11111)
In [41]: dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
x in np.random.randint(1, 999999, 1000)])
In [42]: df = pd.DataFrame({'groupid': np.random.randn(len(dts))}, dts)
因此,只需分组即可直接为您提供群组。你可以迭代,因为这是一个生成器。
In [43]: list(df.groupby(pd.Grouper(freq='ms')))[0:3]
Out[43]:
[(Timestamp('1999-12-31 23:59:59', offset='L'),
groupid
1999-12-31 23:59:59.000789 -1.369503
1999-12-31 23:59:59.000814 0.776049),
(Timestamp('1999-12-31 23:59:59.001000', offset='L'),
groupid
1999-12-31 23:59:59.001041 -0.374915
1999-12-31 23:59:59.001062 -1.470845),
(Timestamp('1999-12-31 23:59:59.002000', offset='L'),
groupid
1999-12-31 23:59:59.002355 -0.240954)]
重新取样可能更简单。您可以使用how
的自定义函数。
In [44]: df.resample('ms',how='sum')
Out[44]:
groupid
1999-12-31 23:59:59.000 -0.593454
1999-12-31 23:59:59.001 -1.845759
1999-12-31 23:59:59.002 -0.240954
1999-12-31 23:59:59.003 1.291403
1999-12-31 23:59:59.004 NaN
1999-12-31 23:59:59.005 0.291484
... ...
1999-12-31 23:59:59.994 NaN
1999-12-31 23:59:59.995 NaN
1999-12-31 23:59:59.996 NaN
1999-12-31 23:59:59.997 -0.445052
1999-12-31 23:59:59.998 NaN
1999-12-31 23:59:59.999 -0.895305
[1000 rows x 1 columns]