如何简化合并数据帧的过程?

时间:2017-05-12 10:52:51

标签: python python-2.7 pandas dataframe

我有几个包含列的数据框:coupon_idrating。 我想合并这些数据框,并将所有coupon_idrating的一个数据框作为所有数据框中此coupon_id的所有评级的总和。

例如。假设我有2个数据帧:

| coupon_id  | rating      |
|:-----------|------------:|
| 1          |          40 |     
| 2          |          60 |    
| 3          |          50 |     
| coupon_id  | rating      |
|:-----------|------------:|
| 4          |          70 |     
| 2          |          80 |    
| 3          |          60 |     

结果我想得到这个数据帧:

| coupon_id  | rating      |
|:-----------|------------:|
| 1          |          40 |     
| 2          |         140 |    
| 3          |         110 |
| 4          |          70 |     

对于这个问题,我使用这个代码,它可以工作,但效率很低

similar_users_ratings = pd.DataFrame(columns=['coupon_id', 'rating'])

    for similarUser in most_similar_users:
        similar_user_ratings = self.ratingData.loc[self.ratingData['patient_id'] == similarUser[0], :].copy()

        similar_user_ratings.loc[:, 'rating'] = similar_user_ratings.loc[:, 'rating'].apply(lambda x: int(x) * similarUser[1])
        del similar_user_ratings['patient_id']
        similar_users_ratings = similar_users_ratings.merge(similar_user_ratings, on='coupon_id', how='outer')
        similar_users_ratings['rating_y'].fillna(.0, inplace=True)
        similar_users_ratings['rating_x'].fillna(.0, inplace=True)
        similar_users_ratings['rating'] = similar_users_ratings['rating_x'] + similar_users_ratings['rating_y']
        del similar_users_ratings['rating_y']
        del similar_users_ratings['rating_x']

如何简化这段代码?感谢。

实际上我有几个数据帧,例如:

      coupon_id  rating
69           12       1

      coupon_id  rating
101          37       1

      coupon_id  rating
428          11       1

      coupon_id  rating
1133         11       1

所需数据集:

 coupon_id   rating
     12        1
     37        1
     11        2

1 个答案:

答案 0 :(得分:1)

<强>更新

In [46]: d1
Out[46]:
    coupon_id  rating
69         12       1

In [47]: d2
Out[47]:
     coupon_id  rating
101         37       1

In [48]: d3
Out[48]:
     coupon_id  rating
428         11       1

In [49]: d4
Out[49]:
      coupon_id  rating
1133         11       1

In [50]: pd.concat([d1,d2,d3,d4],ignore_index=True).groupby('coupon_id', as_index=False)['rating'].sum(
Out[50]:
   coupon_id  rating
0         11       2
1         12       1
2         37       1

OLD回答:

In [219]: d1.set_index('coupon_id').add(d2.set_index('coupon_id'), fill_value=0) \
            .reset_index()
Out[219]:
   coupon_id  rating
0          1    40.0
1          2   140.0
2          3   110.0
3          4    70.0