Question

我有一个pandas'数据框，其中一列保存0到50之间的实际数据。它们不是均匀分布的。

我可以使用以下方式获取发行版：

hist, bins = np.histogram(df["col"])

我想做的是将每个值替换为它所属的二进制数。

要做到这一点，这是有效的：

for i in range(len(df["speed_array"])):
    df["speed_array"].iloc[i] = np.searchsorted(bins, df["speed_array"].iloc[i])

但是，如果数据帧的行数超过4百万，则速度非常慢（50分钟）。我正在寻找一种更有效的方法。你们有更好的主意吗？

Answer 1

只需在整个底层数组数据上使用np.searchsorted -

df["speed_array"] = np.searchsorted(bins, df["speed_array"].values)

运行时测试 -

In [140]: # 4 million rows with 100 bins
     ...: df = pd.DataFrame(np.random.randint(0,1000,(4000000,1)))
     ...: df.columns = [['speed_array']]
     ...: bins = np.sort(np.random.choice(1000, size=100, replace=0))
     ...: 

In [141]: def searchsorted_app(df):
     ...:     df["speed_array"] = np.searchsorted(bins, df["speed_array"].values)
     ...:     

In [142]: %timeit searchsorted_app(df)
10 loops, best of 3: 15.3 ms per loop

鉴于数据分布，对Pandas的专栏进行了分类

1 个答案: