汇总数据从一个数据框到另一个数据框

时间:2019-06-22 15:42:20

标签: python pandas dataframe mapreduce

我想为您提供帮助。

在我的工作中,我有两个DataFrame。第一个名为df_card_features,具有卡功能,并且card_id列具有每个卡的唯一ID。第二个称为df_cart_historic,具有来自第一个数据帧的卡数据;在第二个数据帧中,card_id列没有唯一值,但与第一个数据帧的card_id列相同。

作为解决方案,我考虑过创建一个字典,然后将其包含在数据框中,但就性能而言,此提议对我而言似乎是非常昂贵的,因为历史记录的csv文件大约有5 GB。

# card features:
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e']
date_activation = ['2019-02-01', '2019-05-02', '2018-01-20', '2015-07-23', '2013-07-23']
feature_1_1 = [0, 1, 1, 1, 0]
feature_1_2 = [1, 0, 0, 0, 1]
df_card_features = pd.DataFrame()
df_card_features['card_id'] = card_id
df_card_features['date_activation'] = date_activation
df_card_features['feature_1_1'] = feature_1_1
df_card_features['feature_1_2'] = feature_1_2;
df_card_features.head()


# card historic
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e']
denied_purchase = ['N', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y']
purchase_date = ['2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-10', '2019-02-11', '2019-02-21', '2019-03-01', '2019-03-01', '2019-03-01', '2019-03-31', '2018-04-01', '2016-02-01', '2013-12-01']
installments = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 8, 4, 0 ]
month_lag = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5]
df_cart_historic = pd.DataFrame()
df_cart_historic['card_id'] = card_id
df_cart_historic['denied_purchase'] = denied_purchase
df_cart_historic['purchase_date'] = purchase_date
df_cart_historic['installments'] = installments
df_cart_historic['month_lag'] = month_lag

我需要在df_card_features数据框中创建以下列:

  1. 列“ denied_purchase?”如果df_cart_historic数据帧的否认购买列中至少出现一个Y值,则该值为1;如果card_id没有出现Y,则其值为0。
  2. “ oldest_Date”列,其值是df_cart_historic的purchase_date列中的最早日期
  3. “ max_installments”,它是df_cart_historic的分期付款列的最大值
  4. “ max_month_lag”,它是df_cart_historic的month_lag列的最大值。

1 个答案:

答案 0 :(得分:2)

您需要在groupby的{​​{1}}列上使用'card_id',以便仅使用df_cart_historic具有相同值的行来构建新列。
通过调用'card_id',您可以使用自定义函数groupby('card_id').apply(func)来完成工作。

这是一个工作示例:

func

请注意,带有日期的列使用import pandas as pd # card features: card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e'] date_activation = ['2019-02-01', '2019-05-02', '2018-01-20', '2015-07-23', '2013-07-23'] feature_1_1 = [0, 1, 1, 1, 0] feature_1_2 = [1, 0, 0, 0, 1] df_card_features = pd.DataFrame() df_card_features['card_id'] = card_id df_card_features['date_activation'] = pd.to_datetime(date_activation) #converting to datetime df_card_features['feature_1_1'] = feature_1_1 df_card_features['feature_1_2'] = feature_1_2; df_card_features.head() # card historic card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e'] denied_purchase = ['N', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'] purchase_date = ['2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-10', '2019-02-11', '2019-02-21', '2019-03-01', '2019-03-01', '2019-03-01', '2019-03-31', '2018-04-01', '2016-02-01', '2013-12-01'] installments = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 8, 4, 0 ] month_lag = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5] df_cart_historic = pd.DataFrame() df_cart_historic['card_id'] = card_id df_cart_historic['denied_purchase'] = denied_purchase df_cart_historic['purchase_date'] = pd.to_datetime(purchase_date) #converting to datetime df_cart_historic['installments'] = installments df_cart_historic['month_lag'] = month_lag df_card_features.set_index('card_id', inplace=True) #using card_id column as index def getnewcols(x): res = pd.DataFrame() res['denied_purchase?'] = pd.Series(['Y' if 'Y' in x['denied_purchase'].unique() else 'N']) res['oldest_Date'] = x['purchase_date'].min() res['max_installments'] = x['installments'].max() res['max_month_lag'] = x['month_lag'].max() return res newcols = df_cart_historic.groupby('card_id').apply(getnewcols) newcols = newcols.reset_index().drop('level_1', axis=1).set_index('card_id') df_card_features_final = pd.concat([df_card_features, newcols], axis=1) 进行了解析,以便拥有pandas.to_datetime对象而不是简单的字符串(使用日期非常有用)。
datetime是保存新列的数据框,newcols是包含所有列的最终数据框:

df_card_features_final