如何在具有大型数据集的熊猫中加快滚动总和(LTM)的创建?

时间:2019-05-14 21:11:47

标签: python pandas performance rolling-computation pyarrow

我想为一个包含40万行和7列的数据集计算每日销售额的移动总和(滚动12个月)。我当前的方法似乎有效,但速度很慢(1-2分钟之间)。

列包括:日期(每日输入),国家/地区,商品名称(产品),客户城市,客户编号(ID)和客户名称

由于我使用的其他数据集更大(2+百万行及更多),因此,如果您对如何加快当前代码的速度提出建议,那就太好了

import pandas as pd
import pyarrow.parquet as pq

# import dataset with 300k rows as pandas dataframe
df = pq.read_table('C:/test_cube_300k.parquet').to_pandas(strings_to_categorical=True)

# list for following groupby
list_groupby = [
    "country",
    "item_name",
    "customer_city",
    "customer_number",
    "customer_name"
    ]

# aggregate daily values to monthly view and resample to add months if months are missing (e.g. January and March with entries but February is missing
df_ltm = df.set_index('date').groupby(list_groupby)["sales"].resample("M").sum()

df_ltm = df_ltm.reset_index()
df_ltm = df_ltm.set_index('date')
df_ltm.sort_index(inplace=True)

# rolling twelve months sum accounting for all specifications/columns via groupby, window = 12 months, frequency = months, min_periods = 12
df_ltm = df_ltm.groupby(list_groupby)['sales'].rolling(window=12, freq='M', min_periods=12).sum().fillna(0)

df_ltm = df_ltm.reset_index().sort_index()

0 个答案:

没有答案