从CSV读取输入并将其用于不同功能

时间:2019-11-28 11:48:38

标签: python python-3.x pandas pyspark pyspark-sql

我有一个csv文件,其中包含来自用户的输入。第一列是STATISTIC,这是我的Python代码中的函数,其后各列是每个统计信息的不同输入变量。

即WEIGHTED_MEAN统计信息需要VARIABLE_COLUMN和WEIGHT_VARIABLE。

enter image description here

我使用以下python代码读取了此csv文件,其中model_to_summarise是我需要准备摘要的df,而reprot_inputs是上面的csv:

def parse_report_input_table(model_to_summarise, report_inputs):

    statistics_dict = {
        "WEIGHTED_MEAN": Reporting.weighted_mean,
        "MEAN": Reporting.get_mean_of_columns,
        "SUM": Reporting.get_sum_of_columns,
        "MAX": Reporting.get_max_of_columns,
        "MIN": Reporting.get_min_of_columns,
        "COUNT": Reporting.get_count_of_columns,
        "PERIOD_END_BALANCES": Reporting.period_end_balances,
        "PERIOD_START_BALANCES": Reporting.period_start_balances,
        "AVERAGE_BALANCES": Reporting.average_balances,
        "RATIO_V1": Reporting.ratio_calculation_v1,
        "RATIO_V2": Reporting.ratio_calculation_v2
    }

    list_of_stat_reports = []
    group_by_variables = report_inputs["GROUP_BY_VARIABLES"][0].split(" || ")

    for index in report_inputs.index:
        function_to_call = statistics_dict[report_inputs.loc[index, "STATISTIC"]]
        if function_to_call == Reporting.weighted_mean:
            weighted_mean_report = function_to_call(model_to_summarise, group_by_variables,
                                                    report_inputs.loc[index, "VARIABLE_COLUMN"],
                                                    report_inputs.loc[index, "WEIGHT_VARIABLE"])
            list_of_stat_reports.append(weighted_mean_report)

        elif function_to_call in [
            Reporting.get_count_of_columns, Reporting.get_max_of_columns,
            Reporting.get_mean_of_columns, Reporting.get_min_of_columns,
            Reporting.get_sum_of_columns
                                    ]:
            columns_to_stat = report_inputs.loc[index, "COLUMNS_TO_STAT"].split(" || ")
            simple_stat_report = function_to_call(model_to_summarise,
                                                  group_by_variables,
                                                  columns_to_stat)
            list_of_stat_reports.append(simple_stat_report)

        elif function_to_call in [
            Reporting.period_end_balances,
            Reporting.period_start_balances,
            Reporting.average_balances
                                    ]:
            balances_df = function_to_call(model_to_summarise, group_by_variables,
                                                               report_inputs.loc[index, "UNMODIFIED_DATE_COLUMN"],
                                                               report_inputs.loc[index, "BALANCE_COLUMN"])
            list_of_stat_reports.append(balances_df)

        elif function_to_call == Reporting.ratio_calculation_v1:
            ratio_df_v1 = function_to_call(model_to_summarise, group_by_variables,
                                           report_inputs.loc[index, "NUMERATOR_VARIABLE"],
                                           report_inputs.loc[index, "DENOMINATOR_VARIABLE"],
                                           report_inputs.loc[index, "RATIO_NAME"])
            list_of_stat_reports.append(ratio_df_v1)

        elif function_to_call == Reporting.ratio_calculation_v2:
            ratio_df_v2 = function_to_call(model_to_summarise, group_by_variables,
                                           report_inputs.loc[index, "UNMODIFIED_DATE_COLUMN"],
                                           report_inputs.loc[index, "NUMERATOR_VARIABLE"],
                                           report_inputs.loc[index, "DENOMINATOR_VARIABLE"],
                                           report_inputs.loc[index, "RATIO_NAME"])
            list_of_stat_reports.append(ratio_df_v2)

        else:
            raise Exception("{missing_stat} is not available at the moment!"
                             .format(missing_stat=report_inputs.loc[index, "STATISTIC"]))

    return list_of_stat_reports, group_by_variables

此语句的第一个返回是已创建的数据帧的列表(来自用户从csv文件请求的统计信息)。

在这种情况下,列表将填充weighted_mean_df,mean_df,period_end_balances_df和ratio_v2_df。

如您所见,每个函数都有不同的输入(有些输入相似,因此我将它们分组在if / else语句中)。

字典-statistics_dict目前还不是很大,并且为每个函数写if / elif都可以。

但是此statistics_dict将增加为30-40,并且写入,并且每个统计的if / elif并不是很好的编码。 有没有办法使这种方式更具通用性/动态性?

此刻,我为不同的统计信息编写了if / elif,因为它们具有不同的输入。

这是一个大问题,如果您需要更多说明,请告诉我!

1 个答案:

答案 0 :(得分:0)

我是用这样的课程做到的:

Class ExampleClass:
    def __init__(self, var1, var2, var3, all variables listed like that...):
        self.var1 = var1
        etc.


    def func1(self):
        func1 needs var1 and var3 so I use them by doing self.var1 and self.var3

    def func2(self):
        func2 needs var1 and var2 so I use them by self.var1 and self.var2

    etc. for all the functions

Afterwards I modify the parse_report_input_table function like this:

def parse_report_input_table(model_to_summarise, report_inputs):
    """
    Parse the csv table with inputs from the user.
    """
    list_of_individual_stat_reports = []
    group_by_variables = report_inputs["GROUP_BY_VARIABLES"][0].split(" || ")
    for index in report_inputs.index:
        reporting = Reporting(model_to_summarise,
                              group_by_variables,
                              report_inputs.loc[index, "NUMERATOR_VARIABLE"],
                              report_inputs.loc[index, "DENOMINATOR_VARIABLE"],
                              report_inputs.loc[index, "RATIO_NAME"],
                              report_inputs.loc[index, "VARIABLE_COLUMN"],
                              report_inputs.loc[index, "WEIGHT_VARIABLE"],
                              report_inputs.loc[index, "COLUMNS_TO_STAT"],
                              report_inputs.loc[index, "UNMODIFIED_DATE_COLUMN"],
                              report_inputs.loc[index, "BALANCE_COLUMN"])
        statistics_dict = {
                              "WEIGHTED_MEAN": reporting.weighted_mean,
                              "MEAN": reporting.get_mean_of_columns,
                              "SUM": reporting.get_sum_of_columns,
                              "MAX": reporting.get_max_of_columns,
                              "MIN": reporting.get_min_of_columns,
                              "COUNT": reporting.get_count_of_columns,
                              "PERIOD_END_BALANCES": reporting.period_end_balances,
                              "PERIOD_START_BALANCES": reporting.period_start_balances,
                              "AVERAGE_BALANCES": reporting.average_balances,
                              "RATIO_V1": reporting.ratio_calculation_v1,
                              "RATIO_V2": reporting.ratio_calculation_v2
        }
        list_of_individual_stat_reports.append(statistics_dict[report_inputs.loc[index, "STATISTIC"]]())
    return list_of_individual_stat_reports, group_by_variables


这样,当我调用类时,会创建所有参数,但实际上我要调用的函数仅接受所需的参数。

将接受改进,因为在此之前我对Python类的使用不多: