我从数据仓库中的一个表中获取数据,我必须对所有数据进行应用和迭代,对于每一行,我必须查看同一张表并获取基于同一ID的月份的历史数据。获得历史数据后,我想对其进行迭代以提取每次迭代中的平均值。
数据和代码示例可以在这里找到:https://github.com/jordi-crespo/optimise-data-tranformation-in-python/blob/master/stackover/stackoverflow.py
我必须用Python来做,我曾考虑过要用熊猫来做,但是做所有事情都需要一些时间:
[![在此处输入图片描述] [1]] [1]
df = pd.read('df.csv') df_past = pd.read('df_past_csv') def getdataframe(df,date,id): #transform date into datetime object datetime_object = datetime.strptime(date, '%Y-%m-%d') #transform string column to datetime column df['date']= pd.to_datetime(df['date']) #getting month from 12 previous month previous12thmonth = datetime_object - relativedelta(months=+12) #filter per an_name #filter per siteid df = df[df['id']==id] #filtering filtered_dataframe = df[ (df['date']>= pd.Timestamp(previous12thmonth)) ] return filtered_dataframe def average12months(df_past): for index, past in df_past.iterrows(): avergaeCancelRatePreviousMonths = df_past['price'][:index].mean() if index != 0 else 10 history_considered_months_df.loc[index, 'price'] = price average = df_past['price'].mean() return average, df_past for index, rows in df.iterrows(): df_past = getdataframe(df_past,rows['date'] ,rows['id'] average,df_past = average12months(df_past)
我一直在阅读,为了加快操作速度,可以对熊猫系列使用矢量化: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
如何更改序列的矢量化的插入循环?