Question

假设我为一家提供不同类型贷款的公司工作。我们从一个大数据集市获取我们的贷款信息，我需要计算一些额外的东西，以计算是否有人拖欠等等。现在，为了清楚起见，我做了一个相当愚蠢的功能，迭代所有行（通过贷款存储所有信息）都使用pd.DataFrame.apply(myFunc, axis=1)函数，这个函数非常慢。

现在我们正在成长并且我需要处理越来越多的数据，我开始关注性能。下面是一个我调用很多函数的例子，并且想要优化（我在下面提到的一些想法）。这些函数应用于DataFrame，其中包含（a.o.）以下字段：

Loan_Type：包含确定贷款类型的字符串的字段，我们有许多不同的名称，但它归结为4种类型（对于此示例）;输入1和类型2，以及员工是否有此贷款。
Activity_Date：记录贷款活动的日期（这是每日贷款活动表，如果它告诉您任何事情）
Product_Account_Status：表格给Activity_Date上的这些贷款（他们是活跃的，还是其他一些状态？）的状态，这需要重新计算，因为它并不总是在表中计算（不要问为什么会这样，完全头痛）。
Activation_Date：贷款被激活的日期
Sum_Paid_To_Date：Activity_Date
Deposit_Amount：贷款的存款金额
Last_Paid_Date：付款到贷款的最后日期。

所以有两个示例函数：

    def productType(x):
        # Determines the type of the product, for later aggregation purposes, and to determine the amount to be payable per day
        if ('Loan Type 1' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
            return 'Loan1'
        elif ('Loan Type 2' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
            return 'Loan2'
        elif ('Loan Type 1' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
            return 'Loan1Staff'
        elif ('Loan Type 2' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
            return 'Loan2Staff'
        elif ('Mobile' in x['Loan_Type']) | ('MM' in x['Loan_Type']):
            return 'Other'
        else:
            raise ValueError(
                'A payment plan is not captured in the code, please check it!')

然后将此函数应用于DataFrame AllLoans，其中包含我当时要分析的所有贷款，使用：

AllLoans['productType'] = AllLoans.apply(lambda x: productType(x), axis = 1)

然后我想应用其他一些函数，下面给出了这样一个函数的一个例子。此函数确定贷款是否被阻止，具体取决于某人未付款的时间长度，以及其他一些重要但仍存储在贷款表中的字符串中的状态。这方面的例子是人们是否被取消（被封锁太久）或其他一些状态，我们会根据这些标签对客户进行不同的对待。

def customerStatus(x):
    # Sets the customer status based on the column Product_Account_Status or
    # the days of inactivity

    if x['productType'] == 'Loan1':
        dailyAmount = 2
    elif x['productType'] == 'Loan2':
        dailyAmount = 2.5
    elif x['productType'] == 'Loan1Staff':
        dailyAmount = 1
    elif x['productType'] == 'Loan2Staff':
        dailyAmount = 1.5
    else:
        raise ValueError(
            'Daily amount to be paid could not be calculated, check if productType is defined.')

    if x['Product_Account_Status'] == 'Cancelled':
        return 'Cancelled'
    elif x['Product_Account_Status'] == 'Suspended':
        return 'Suspended'
    elif x['Product_Account_Status'] == 'Pending Deposit':
        return 'Pending Deposit'
    elif x['Product_Account_Status'] == 'Pending Allocation':
        return 'Pending Allocation'
    elif x['Outstanding_Balance'] == 0:
        return 'Finished Payment'
    # If this check returns True it means that Last_Paid_Date is zero/null, as
    # far as I can see this means that the customer has only paid the deposit
    # and is thus an FPD
    elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) != (pd.tslib.NaTType):
        if (((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 30) | ((((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 14) & ((x['Sum_Paid_To_Date'] - x['Deposit_Amount']) <= dailyAmount)):
            return 'Blocked'
        elif ((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) <= 30:
            return 'Active'
    # If this is True, the customer has not paid more than the deposit, so it
    # will fall on the age of the customer whether they are blocked or not
    elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) == (pd.tslib.NaTType):
        # The date is changed here to 14 because of FPD definition
        if ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) <= 14:
            return 'Active'
        elif ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) > 14:
            return 'Blocked'
    # If we have reached the end and still haven't found the status, it will
    # get the following status
    return 'Other Status'

使用AllLoans['customerStatus'] = AllLoans.apply(lambda x: customerStatus(x), axis = 1)再次应用此功能。正如你所看到的那样，有许多字符串比较和日期比较，对于我如何'正确'地矢量化这些函数，我有点困惑。

道歉，如果这是优化101，但试图搜索有关如何执行此操作的答案和策略，但找不到真正全面的答案。我希望能在这里得到一些提示，感谢您的时间。

关于加快/实现更多矢量化方法的一些想法：

通过创建一个确定每日金额的函数，使customerStatus函数略微更加模块化，并将其存储在数据框中以便更快地访问（我需要稍后访问它们，并确定此变量）多功能）。
使用某种dict将productType函数的输入列转换为整数，这样就可以调用更少的字符串函数（但感觉这不是我最大的加速）

我想做的一些事情，但不知道从哪里开始;

如何根据数据框中的不同列，正确地向量化包含基于字符串/日期比较的许多if语句的这些函数（业务规则在这里可能有点复杂）。代码可能会变得有点复杂，但我需要将这些函数多次应用于稍微不同（但重要的是不同）的数据帧，并且这些数据越来越大，因此这些函数需要在某种类型的库中以便于访问，代码需要加快，因为它只需要很长时间。

尝试搜索某些解决方案，例如Numba或Cython但我对C的内部工作原理不够了解，以便正确使用它（或者只是想学习）。任何关于如何提高性能的建议都将不胜感激。

亲切的问候，

添

在Pandas

0 个答案: