对数据帧(行)进行优化迭代

时间:2017-10-14 21:35:37

标签: python pandas numpy dataframe

我正在尝试处理数据帧。这包括根据其他列中的值创建列并更新其值。在这种情况下,我有一个我要分类的给定payment_type。它可以分为三类:cash, deb_cred, gift_card。我想在数据帧中添加三个新列,这些列由基于给定参数的1或0组成

我目前能够做到这一点,它只是非常慢(在约70k行,~20列的数据集上的AWS M4实例上多个小时)......

原始栏目样本:

_id Payment tender types
1   debit
2   comptant
3   visa
4   mastercard
5   tim card
6   cash
7   gift

期望的输出:

_id Payment tender types    pay_cash    pay_deb_cred    pay_gift
1   debit   0   1   0
2   comptant    1   0   0
3   visa    0   1   0
4   mastercard  0   1   0
5   tim card    0   0   1
6   cash    1   0   0
7   gift    0   0   1

我目前的代码:
注意:data是在此代码段之前加载的数据框(70000,20)

# For 'Payment tender types' we will use the following classes:
payment_cats = ['pay_cash', 'pay_deb_cred', 'pay_gift_card']
# [0, 0, 0] would imply 'other', hence no need for a fourth category

# note that certain types are just pieces of the name: e.g. master for "mastercard" and "master card"
types = ['debit', 'tim', 'cash', 'visa', 'amex', 'master',
     'digital', 'comptant', 'gift', 'débit']
cash_types = ['cash', 'comptant']
deb_cred_types = ['debit', 'visa', 'amex', 'master', 'digital', 'débit'
              'discover', 'bit', 'mobile']
gift_card_types = ['tim','gift']


# add new features to dataframe, initializing to nan
for cat in payment_cats:
    data[cat] = np.nan

for row in data.itertuples():
    # create series to hold the result per row e.g. [1, 0, 0] for `cash`
    cat = [0, 0, 0]
    index = row[0]
    # to string as some entries are numerical
    payment_type = row.paymenttendertypes.lower()
    if any(ct in payment_type for ct in cash_types):
        cat[0] = 1
    if any(dbt in payment_type for dbt in deb_cred_types):
        cat[1] = 1
    if any(gct in payment_type for gct in gift_card_types):
        cat[2] = 1
    # add series to payment_cat dataframe
    data.loc[index, payment_cats] = cat

我使用的是itertuples(),因为它比interrows()更快。

是否有更快的方法来实现与上述相同的功能? 这可以在不迭代整个df的情况下完成吗?

注意:这不仅仅是关于创建一个热编码。归结为更新列值取决于另一列的值。例如,另一个用例是,如果我有一个特定的location_id,我想更新其各自的经度和纬度列 - 基于原始id(不按照我上面的方式迭代,因为它对于大型数据集来说真的很慢)。

1 个答案:

答案 0 :(得分:2)

我非常确定您需要的是:

targets = cash_types, deb_cred_types, gift_card_types 
payments = data.Payment.str.lower()
for col_name, words in zip(payment_cats, targets):
    data[col_name] = payments.isin(words)

注意,使用itertuples的原始代码有点奇怪,因为您会继续索引回数据框,只是为了恢复已经迭代的行,例如

 str(data.loc[index, 'payment_tender_types']).lower()

这可能只是row.Payment.lower()