Question

我想通过唯一的 y 值聚合点 (x,y)，并取平均 x 值。我还需要知道每个唯一 y 值的计数以及聚合原始数据中的哪些点，因此我可以返回如何反转此操作而无需说明聚合是如何完成的（因为我想要一个用于不同聚合的通用接口算法）。

我有以下实现，它工作正常，但速度太慢：

import numpy as np
import pandas as pd
import time

# generate input:
n_points = 10000000 # 10 mil pts
n_distinct_points = 5000

y_dist = np.random.rand(n_distinct_points)
y_idx = np.random.randint(n_distinct_points, size=n_points)
Y = y_dist[y_idx]
X = np.random.rand(n_points)

start_pt = time.process_time()
start_pc = time.perf_counter()

# make a df to perform group by
df = pd.DataFrame({'x':np.ravel(X), 'y':np.ravel(Y)})       
grouped = df.groupby('y')
x_mean = grouped['x'].mean()
x_aggregated = x_mean.to_numpy() 
y_aggregated = x_mean.index.to_numpy()
s_aggregated = grouped.size().to_numpy()
m_aggregated = grouped.indices.values()

duration_pt = time.process_time() - start_pt
duration_pc = time.perf_counter() - start_pc
print("cpu time {:.5f}s, real time {:.5f}s".format(duration_pt, duration_pc))

这在我的电脑上大约需要 1.3 秒，计算 m_aggregated 需要 0.7 秒。

有谁知道如何让它更快？

提高 Pandas GroupBy 的性能以获取索引

0 个答案: