我想通过唯一的 y 值聚合点 (x,y),并取平均 x 值。我还需要知道每个唯一 y 值的计数以及聚合原始数据中的哪些点,因此我可以返回如何反转此操作而无需说明聚合是如何完成的(因为我想要一个用于不同聚合的通用接口算法)。
我有以下实现,它工作正常,但速度太慢:
import numpy as np
import pandas as pd
import time
# generate input:
n_points = 10000000 # 10 mil pts
n_distinct_points = 5000
y_dist = np.random.rand(n_distinct_points)
y_idx = np.random.randint(n_distinct_points, size=n_points)
Y = y_dist[y_idx]
X = np.random.rand(n_points)
start_pt = time.process_time()
start_pc = time.perf_counter()
# make a df to perform group by
df = pd.DataFrame({'x':np.ravel(X), 'y':np.ravel(Y)})
grouped = df.groupby('y')
x_mean = grouped['x'].mean()
x_aggregated = x_mean.to_numpy()
y_aggregated = x_mean.index.to_numpy()
s_aggregated = grouped.size().to_numpy()
m_aggregated = grouped.indices.values()
duration_pt = time.process_time() - start_pt
duration_pc = time.perf_counter() - start_pc
print("cpu time {:.5f}s, real time {:.5f}s".format(duration_pt, duration_pc))
这在我的电脑上大约需要 1.3 秒,计算 m_aggregated 需要 0.7 秒。
有谁知道如何让它更快?