Question

我在工作

https://www.kaggle.com/c/competitive-data-science-final-project

，我想编写自己的缩放器。为此，我想创建一个新列，该列将是基于shop_id和year year的收入平均值。通过这个值，我打算扩展收入数据。因此，本主题涉及将分组序列中的值获取到数据框中。

分组系列：

all_sales.groupby(by=["year","quarter","shop_id"]).revenue.mean()

// ..........................................
// : year : quarter : shop_id :             :
// :......:.........:.........:.............:
// : 2013 :       1 :       0 :  673.366136 :
// :      :         :       1 :  570.307679 :
// :      :         :       2 : 1060.903808 :
// :      :         :       3 :  742.238854 :
// :      :         :       4 :  793.700453 :
// :      :         :       5 :  634.066920 :
// :......:.........:.........:.............:

我想映射它，我搜索了其他解决方案，但是我只找到固定值的解决方案-例如，如果我想为shop_id 54映射一个平均值，我只会使用apply（...）。我在考虑使用lambda，map，applymap

这是数据集-起始阶段：

+---------+------------+----------------+---------+---------+------------+--------------+-----------+-----------+------------------+-----------------+---------+---------+------+---------+-------+
|         |    date    | date_block_num | shop_id | item_id | item_price | item_cnt_day | latitude  | longitude | item_category_id | category_number | weekday | revenue | year | quarter | Q_avg |
+---------+------------+----------------+---------+---------+------------+--------------+-----------+-----------+------------------+-----------------+---------+---------+------+---------+-------+
| 2580899 | 2013-01-01 |              0 |      28 |   10832 |      399.0 |          1.0 | 55.604232 | 37.491973 |               40 |               4 |       1 |   399.0 | 2013 |       1 |     0 |
|  835894 | 2013-01-01 |              0 |      41 |   12750 |      119.0 |          1.0 | 47.289983 | 39.847001 |               40 |               4 |       1 |   119.0 | 2013 |       1 |     0 |
|  519245 | 2013-01-01 |              0 |      28 |    1249 |      299.0 |          1.0 | 55.604232 | 37.491973 |               55 |              12 |       1 |   299.0 | 2013 |       1 |     0 |
| 2578875 | 2013-01-01 |              0 |      42 |   10555 |      199.0 |          1.0 | 59.932022 | 30.359193 |               55 |              12 |       1 |   199.0 | 2013 |       1 |     0 |
|  303337 | 2013-01-01 |              0 |       7 |    3325 |     1199.0 |          1.0 | 51.696833 | 39.273076 |               30 |              14 |       1 |  1199.0 | 2013 |       1 |     0 |
| 2508902 | 2013-01-01 |              0 |      27 |   12623 |      649.0 |          1.0 | 55.658314 | 37.845215 |               37 |               4 |       1 |   649.0 | 2013 |       1 |     0 |
|  330640 | 2013-01-01 |              0 |      19 |   17707 |      899.0 |          1.0 | 51.737952 | 36.192223 |               19 |               8 |       1 |   899.0 | 2013 |       1 |     0 |
|  316521 | 2013-01-01 |              0 |       7 |    3693 |      299.5 |          1.0 | 51.696833 | 39.273076 |               21 |               8 |       1 |   299.5 | 2013 |       1 |     0 |
|  835868 | 2013-01-01 |              0 |      15 |   12750 |      119.0 |          1.0 | 54.516081 | 36.246664 |               40 |               4 |       1 |   119.0 | 2013 |       1 |     0 |
|   60981 | 2013-01-01 |              0 |      28 |    5272 |      598.5 |          1.0 | 55.604232 | 37.491973 |               30 |              14 |       1 |   598.5 | 2013 |       1 |     0 |
+---------+------------+----------------+---------+---------+------------+--------------+-----------+-----------+------------------+-----------------+---------+---------+------+---------+-------+

速度太慢，但应该可以工作；

avg=[]
for i in range(len(all_sales)):
    shopid=all_sales.shop_id.iloc[i]
    year=all_sales.year.iloc[i]
    quarter=all_sales.quarter.iloc[i]
    avg.append(all_sales.groupby(by=["year","quarter","shop_id"]).revenue.mean().loc[year].loc[quarter].loc[shopid])
    print(i)
all_sales.Q_avg=avg

更快，但不会将值保存在Q_Avg中：

all_sales["Q_avg"]=0
for year in [2013,2014,2015]:
    for quar in [1,2,3,4]:
        for shopid in all_sales[(all_sales.year==year)&(all_sales.quarter==quar)].shop_id.unique():
            all_sales[(all_sales.year==year)&(all_sales.quarter==quar)&(all_sales.shop_id==shopid)].Q_avg=all_sales.groupby(by=["year","quarter","shop_id"]).revenue.mean().loc[year].loc[quar].loc[shopid]

目标（在Q_avg中）：


// ...............................................................................................................................................................................................................
// :         :    date    : date_block_num : shop_id : item_id : item_price : item_cnt_day : latitude  : longitude : item_category_id : category_number : weekday : revenue : year : quarter :       Q_avg       :
// :.........:............:................:.........:.........:............:..............:...........:...........:..................:.................:.........:.........:......:.........:...................:
// : 2580899 : 2013-01-01 :              0 :      28 :   10832 :      399.0 :          1.0 : 55.604232 : 37.491973 :               40 :               4 :       1 :   399.0 : 2013 :       1 : 846.1797826935596 :
// :  835894 : 2013-01-01 :              0 :      41 :   12750 :      119.0 :          1.0 : 47.289983 : 39.847001 :               40 :               4 :       1 :   119.0 : 2013 :       1 : 992.6000783392107 :
// :  519245 : 2013-01-01 :              0 :      28 :    1249 :      299.0 :          1.0 : 55.604232 : 37.491973 :               55 :              12 :       1 :   299.0 : 2013 :       1 : 846.1797826935596 :
// : 2578875 : 2013-01-01 :              0 :      42 :   10555 :      199.0 :          1.0 : 59.932022 : 30.359193 :               55 :              12 :       1 :   199.0 : 2013 :       1 : 980.5774001136369 :
// :  303337 : 2013-01-01 :              0 :       7 :    3325 :     1199.0 :          1.0 : 51.696833 : 39.273076 :               30 :              14 :       1 :  1199.0 : 2013 :       1 : 868.1195129284332 :
// :.........:............:................:.........:.........:............:..............:...........:...........:..................:.................:.........:.........:......:.........:...................:

我的代码太慢了，我正在寻找更快的解决方案。因此，我希望最后一列Q_avg会填充groupby中的值（基于年份，季度，shop_id）

在Python中将值从groupby映射到Dataframe

0 个答案: