Question

我是python和pandas的新手，由于经验有限，我想出了一个效率低下的解决方案，使我的代码太慢。

我有一些与股市价格相对应的数据。
随机抽样（纳秒级）。
我正在尝试实现的是使用固定采样率将其转换为新的数据集。

我正在如下转换数据集：

我将time_delta设置为0.5秒的静态时间步长
我要删除对应于同一纳秒的记录
我正在从我的start_time到计算出的end_time生成时间戳记
我正在遍历原始数据帧，并将我的time_delta中的最后一个已知记录复制（并在需要时进行复制），每一步都移至新数据帧。

我相信我的问题可能是我将记录一个接一个地追加到新的数据框中，但是我一直无法找到一种利用内置的pandas优化我的代码的方法。

在Google Colab上执行时，当前一天的数据运行时间约为4分钟（将大约3万个样本转换为57600）。
我也在本地进行了测试，没有任何改善。


# ====================================================================
# Rate Re-Definition
# ====================================================================

SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES

start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

fixed_timestamps = np.arange(start_of_day_timestamp,
                             end_of_day_timestamp,
                             dt,
                             dtype=np.uint64
                            )


# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================

df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


# ====================================================================
# Construct new dataframe
# ====================================================================

df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0

for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
  while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
    index += 1  
  df2 = df2.append(df1.iloc[index], ignore_index=True)

df2['TimeStamp'] = fixed_timestamps

我需要尽可能减少时间（在保持可读性/可维护性的同时，无需使用“ hacks”）。

我将不胜感激，并向正确的方向提供指导。

预先感谢

熊猫数据帧重建的时间优化（随机抽样）

0 个答案: