Python Pandas-创建一个函数来替换重复的DataFrames

时间:2019-12-13 23:04:30

标签: python pandas function dataframe group-by

我是Python的新手,并设法构建了以下代码,这些代码可在四个单独的数据帧中产生预期的结果

import pandas as pd
x2019 = df.Date.between('2015-06-28','2015-07-04') #Transaction Dates we want to analyze
y2019 = df.First_Purchase_Date.between('2014-01-01','2015-07-04') #customer first purchase dates we want to include in the dataset

TABLE_2019_USA_XX = df.loc[x2019 & y2019 & (df['Region'] == 'USA')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2019_USA_XX['TotalCusts'] = TABLE_2019_USA_XX['New Customer'] + TABLE_2019_USA_XX['Existing Customer']

TABLE_2019_CANADA_XX = df.loc[x2019 & y2019 & (df['Region'] == 'Canada')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2019_CANADA_XX['TotalCusts'] = TABLE_2019_CANADA_XX['New Customer'] + TABLE_2019_CANADA_XX['Existing Customer']

x2018 = df.Date.between('2014-07-23','2014-07-28') #Transaction Dates we want to analyze
y2018 = df.First_Purchase_Date.between('2014-01-01','2014-07-30') #customer first purchase dates we want to include in the dataset

TABLE_2018_USA_XX = df.loc[x2018 & y2018 & (df['Region'] == 'USA')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2018_USA_XX['TotalCusts'] = TABLE_2018_USA_XX['New Customer'] + TABLE_2018_USA_XX['Existing Customer']
TABLE_2018_CANADA_XX = df.loc[x2018 & y2018 & (df['Region'] == 'Canada')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2018_CANADA_XX['TotalCusts'] = TABLE_2018_CANADA_XX['New Customer'] + TABLE_2018_CANADA_XX['Existing Customer']

print(TABLE_2018_USA_XX)
print(TABLE_2019_USA_XX)
print(TABLE_2018_CANADA_XX)
print(TABLE_2019_CANADA_XX)

输出

FPYear  New Customer    Existing Customer   revenue TotalCusts
2014    0                     23              134   23
2015    12                    32              432   44


FPYear  New Customer    Existing Customer   revenue TotalCusts
2014    432                   421            4315    853
2015    3415                  452            2341    3867

FPYear  New Customer    Existing Customer   revenue TotalCusts
2014    22                  432              4312    454
2015    33                  345              3415    378

FPYear  New Customer    Existing Customer   revenue TotalCusts
2014    5                   35               4312    40
2015    432                 32               6131    464

基于我在构建此脚本时所获得的信息和反馈,我知道我应该能够使用一个函数来构建上述内容,但是我不知道该怎么做。有人可以提出建议让我入门。我本质上是在尝试减少脚本并使其更有效率。

2 个答案:

答案 0 :(得分:1)

IIUC,您在数据框中有重复的列,并且一次又一次地执行相同的操作吗?

dfs = ['TABLE_2019_CANADA_XX', 'TABLE_2018_CANADA_XX','TABLE_2018_USA_XX', 'TABLE_2019_USA_XX']

df = pd.concat(dfs)

df.groupby(['FPYear','Region'])[['New Customer', 'Existing Customer', 'revenue']].sum()

答案 1 :(得分:1)

只需定义一个函数并向参数传递用作过滤器的日期和区域:

import pandas as pd
def process(df, start_dt, end_dt, purch_start, purch_end, region):
    mask_date = df['Date'].between(start_dt, end_dt)
    mask_purch_date = df['First_Purchase_Date'].between(purch_start, purch_end)
    mask_region = df['Region'] == region

    temp_df = df[mask_date & mask_purch_date & mask_region].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum()

    temp_df['TotalCusts'] = temp_df['New Customer'] + temp_df['Existing Customer']

    return temp_df


TABLE_2019_USA_XX = process(df,'2015-06-28','2015-07-04', '2014-01-01','2015-07-04', 'USA')

TABLE_2019_CANADA_XX = process(df,'2015-06-28','2015-07-04', '2014-01-01','2015-07-04', 'Canada')

TABLE_2018_USA_XX = process(df,'2014-07-23','2014-07-28', '2014-01-01','2014-07-30', 'USA')

TABLE_2018_CANADA_XX = process(df,'2014-07-23','2014-07-28','2014-01-01','2014-07-30', 'Canada')