我是Python的新手,并设法构建了以下代码,这些代码可在四个单独的数据帧中产生预期的结果
import pandas as pd
x2019 = df.Date.between('2015-06-28','2015-07-04') #Transaction Dates we want to analyze
y2019 = df.First_Purchase_Date.between('2014-01-01','2015-07-04') #customer first purchase dates we want to include in the dataset
TABLE_2019_USA_XX = df.loc[x2019 & y2019 & (df['Region'] == 'USA')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2019_USA_XX['TotalCusts'] = TABLE_2019_USA_XX['New Customer'] + TABLE_2019_USA_XX['Existing Customer']
TABLE_2019_CANADA_XX = df.loc[x2019 & y2019 & (df['Region'] == 'Canada')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2019_CANADA_XX['TotalCusts'] = TABLE_2019_CANADA_XX['New Customer'] + TABLE_2019_CANADA_XX['Existing Customer']
x2018 = df.Date.between('2014-07-23','2014-07-28') #Transaction Dates we want to analyze
y2018 = df.First_Purchase_Date.between('2014-01-01','2014-07-30') #customer first purchase dates we want to include in the dataset
TABLE_2018_USA_XX = df.loc[x2018 & y2018 & (df['Region'] == 'USA')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2018_USA_XX['TotalCusts'] = TABLE_2018_USA_XX['New Customer'] + TABLE_2018_USA_XX['Existing Customer']
TABLE_2018_CANADA_XX = df.loc[x2018 & y2018 & (df['Region'] == 'Canada')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2018_CANADA_XX['TotalCusts'] = TABLE_2018_CANADA_XX['New Customer'] + TABLE_2018_CANADA_XX['Existing Customer']
print(TABLE_2018_USA_XX)
print(TABLE_2019_USA_XX)
print(TABLE_2018_CANADA_XX)
print(TABLE_2019_CANADA_XX)
输出
FPYear New Customer Existing Customer revenue TotalCusts
2014 0 23 134 23
2015 12 32 432 44
FPYear New Customer Existing Customer revenue TotalCusts
2014 432 421 4315 853
2015 3415 452 2341 3867
FPYear New Customer Existing Customer revenue TotalCusts
2014 22 432 4312 454
2015 33 345 3415 378
FPYear New Customer Existing Customer revenue TotalCusts
2014 5 35 4312 40
2015 432 32 6131 464
基于我在构建此脚本时所获得的信息和反馈,我知道我应该能够使用一个函数来构建上述内容,但是我不知道该怎么做。有人可以提出建议让我入门。我本质上是在尝试减少脚本并使其更有效率。
答案 0 :(得分:1)
IIUC,您在数据框中有重复的列,并且一次又一次地执行相同的操作吗?
dfs = ['TABLE_2019_CANADA_XX', 'TABLE_2018_CANADA_XX','TABLE_2018_USA_XX', 'TABLE_2019_USA_XX']
df = pd.concat(dfs)
df.groupby(['FPYear','Region'])[['New Customer', 'Existing Customer', 'revenue']].sum()
答案 1 :(得分:1)
只需定义一个函数并向参数传递用作过滤器的日期和区域:
import pandas as pd
def process(df, start_dt, end_dt, purch_start, purch_end, region):
mask_date = df['Date'].between(start_dt, end_dt)
mask_purch_date = df['First_Purchase_Date'].between(purch_start, purch_end)
mask_region = df['Region'] == region
temp_df = df[mask_date & mask_purch_date & mask_region].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum()
temp_df['TotalCusts'] = temp_df['New Customer'] + temp_df['Existing Customer']
return temp_df
TABLE_2019_USA_XX = process(df,'2015-06-28','2015-07-04', '2014-01-01','2015-07-04', 'USA')
TABLE_2019_CANADA_XX = process(df,'2015-06-28','2015-07-04', '2014-01-01','2015-07-04', 'Canada')
TABLE_2018_USA_XX = process(df,'2014-07-23','2014-07-28', '2014-01-01','2014-07-30', 'USA')
TABLE_2018_CANADA_XX = process(df,'2014-07-23','2014-07-28','2014-01-01','2014-07-30', 'Canada')