我正在尝试处理数据帧。这包括根据其他列中的值创建列并更新其值。在这种情况下,我有一个我要分类的给定payment_type
。它可以分为三类:cash, deb_cred, gift_card
。我想在数据帧中添加三个新列,这些列由基于给定参数的1或0组成
我目前能够做到这一点,它只是非常慢(在约70k行,~20列的数据集上的AWS M4实例上多个小时)......
原始栏目样本:
_id Payment tender types
1 debit
2 comptant
3 visa
4 mastercard
5 tim card
6 cash
7 gift
期望的输出:
_id Payment tender types pay_cash pay_deb_cred pay_gift
1 debit 0 1 0
2 comptant 1 0 0
3 visa 0 1 0
4 mastercard 0 1 0
5 tim card 0 0 1
6 cash 1 0 0
7 gift 0 0 1
我目前的代码:
注意:data
是在此代码段之前加载的数据框(70000,20)
# For 'Payment tender types' we will use the following classes:
payment_cats = ['pay_cash', 'pay_deb_cred', 'pay_gift_card']
# [0, 0, 0] would imply 'other', hence no need for a fourth category
# note that certain types are just pieces of the name: e.g. master for "mastercard" and "master card"
types = ['debit', 'tim', 'cash', 'visa', 'amex', 'master',
'digital', 'comptant', 'gift', 'débit']
cash_types = ['cash', 'comptant']
deb_cred_types = ['debit', 'visa', 'amex', 'master', 'digital', 'débit'
'discover', 'bit', 'mobile']
gift_card_types = ['tim','gift']
# add new features to dataframe, initializing to nan
for cat in payment_cats:
data[cat] = np.nan
for row in data.itertuples():
# create series to hold the result per row e.g. [1, 0, 0] for `cash`
cat = [0, 0, 0]
index = row[0]
# to string as some entries are numerical
payment_type = row.paymenttendertypes.lower()
if any(ct in payment_type for ct in cash_types):
cat[0] = 1
if any(dbt in payment_type for dbt in deb_cred_types):
cat[1] = 1
if any(gct in payment_type for gct in gift_card_types):
cat[2] = 1
# add series to payment_cat dataframe
data.loc[index, payment_cats] = cat
我使用的是itertuples(),因为它比interrows()更快。
是否有更快的方法来实现与上述相同的功能? 这可以在不迭代整个df的情况下完成吗?
注意:这不仅仅是关于创建一个热编码。归结为更新列值取决于另一列的值。例如,另一个用例是,如果我有一个特定的location_id,我想更新其各自的经度和纬度列 - 基于原始id(不按照我上面的方式迭代,因为它对于大型数据集来说真的很慢)。
答案 0 :(得分:2)
我非常确定您需要的是:
targets = cash_types, deb_cred_types, gift_card_types
payments = data.Payment.str.lower()
for col_name, words in zip(payment_cats, targets):
data[col_name] = payments.isin(words)
注意,使用itertuples
的原始代码有点奇怪,因为您会继续索引回数据框,只是为了恢复已经迭代的行,例如
str(data.loc[index, 'payment_tender_types']).lower()
这可能只是row.Payment.lower()