这是我的数据框:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 6, 4, 3, 2, 7]})
buckets = [(0,3),(3,5),(5,9)]
我也有上述的直方图桶。现在我想将每行数据帧分配给存储桶索引。所以我想获得包含以下信息的新专栏:
df['buckets_index'] = [0,0,0,1,2,1,0,0,2]
当然,我可以用循环来做,但是我有相当大的数据帧(2.5 mil行),所以我需要快速完成它。
有什么想法吗?
答案 0 :(得分:2)
如果您只想要索引,则可以使用pd.cut
和labels=False
:
buckets = [0,3,5,9]
df['bucket'] = pd.cut(df['A'], bins=buckets)
df['bucket_idx'] = pd.cut(df['A'], bins=buckets, labels=False)
结果输出:
A bucket bucket_idx
0 1 (0, 3] 0
1 2 (0, 3] 0
2 3 (0, 3] 0
3 4 (3, 5] 1
4 6 (5, 9] 2
5 4 (3, 5] 1
6 3 (0, 3] 0
7 2 (0, 3] 0
8 7 (5, 9] 2
答案 1 :(得分:1)
您可以使用np.searchsorted
-
df['buckets_index'] = np.asarray(buckets)[:,1].searchsorted(df.A.values)
运行时测试 -
In [522]: df = pd.DataFrame({'A': np.random.randint(1,8,(10000))})
In [523]: buckets = [0,3,5,9]
In [524]: %timeit pd.cut(df['A'], bins=buckets, labels=False)
1000 loops, best of 3: 460 µs per loop # @root's soln
In [525]: buckets = [(0,3),(3,5),(5,9)]
In [526]: %timeit np.asarray(buckets)[:,1].searchsorted(df.A.values)
10000 loops, best of 3: 166 µs per loop
外部限制案例:对于这种情况,我们需要使用剪辑,如此 -
np.asarray(buckets)[:,1].searchsorted(df.A.values).clip(max=len(buckets)-1)