以熊猫为单位的增量分组

时间:2020-07-10 20:01:29

标签: python pandas dataframe pandas-groupby

我正在尝试按熊猫进行增量分组和排名。

样本DF:

df = pd.DataFrame(np.random.randint(0,20,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df


    a   b   c     d1      d2     date
0   7   1   4   Apple   Orange  2002-01-01
1   3   7   6   Mango   lemon   2002-01-01
2   9   6   9   Apple   lemon   2002-01-01
3   0   5   8   Mango   Orange  2002-01-01
4   4   6   7   Mango   lemon   2002-02-01
5   4   3   8   Mango   Orange  2002-02-01
6   0   2   8   Apple   lemon   2002-02-01

尝试按d1递增分组,并根据另一列d1对列c的每一行进行排名。

对于位置[0,"d1"],值Apple将排在0上,因为只有一行,没有比较退出。

对于位置[1,"d1"],值Mango将是1,因为考虑前两行,列AppleC的对应值,即value [0,"c"]中的4(Apple)和[1,"C"]的值是6(对于Mango),因此Mango在此切片DF中具有较高的排名

对于位置[2,"d1"]来说,值Apple将是1,因为考虑前三行是列AppleC的对应值,即value [0,"c"]的值为4(苹果),[1,"C"]的值为6(对于芒果),[2,"c"]的值为9( Apple),因此Apple的2个值的平均值为(4+9)/2 =6.5,而Mango的值为6,因此Apple的排名为1

递增地遵循相同的模式,并在递增切片的DF的最后一个索引处更新列d1的值。

d1列的期望值:

0
1
1
1 => since for Apple (4+9)/2 and for Mango (6+8)/2
1 => since for Apple (4+9)/2 and for Mango (6+8+7)/3
1 => since for Apple (4+9)/2 and for Mango (6+8+7+8)/4
0 => since for Apple (4+9+8)/2 and for Mango (6+8+7+8)/4

我可以通过迭代切片df[:i]来在for循环中执行此操作,但是对于大型DF来说,这将永远花费,有关基于熊猫的方法的任何建议都将是很好的。

将第一种解决方案应用于以下随机DF:

    a   b   c     d1       d2    date
0   7   1   19  Apple   Orange  2002-01-01
1   3   7   17  Mango   lemon   2002-01-01
2   9   6   4   Apple   lemon   2002-01-01
3   0   5   15  Apple   Orange  2002-01-01
4   4   6   8   Mango   lemon   2002-02-01
5   4   3   1   Mango   Orange  2002-02-01
6   2   2   14  Apple   lemon   2002-02-01
7   5   15  10  Mango   Orange  2002-01-01
8   1   2   10  Apple   lemon   2002-02-01
9   2   1   12  Apple   Orange  2002-02-01

我得到d1的以下值:

0
0
0
1
0
0
1
0
1
0
      

最后一个值是错误的,因为此时Apple的值为12.33(19 + 4 + 15 + 14 + 10 + 12)/ 6而Mango的值为{{ 1}}(17 + 8 + 1 + 10)/ 4,因此9的最后一个值应为d1

1 个答案:

答案 0 :(得分:1)

已更新第二个数据帧:

   a   b   c     d1       d2    date
0   7   1   19  Apple   Orange  2002-01-01
1   3   7   17  Mango   lemon   2002-01-01
2   9   6   4   Apple   lemon   2002-01-01
3   0   5   15  Apple   Orange  2002-01-01
4   4   6   8   Mango   lemon   2002-02-01
5   4   3   1   Mango   Orange  2002-02-01
6   2   2   14  Apple   lemon   2002-02-01
7   5   15  10  Mango   Orange  2002-01-01
8   1   2   10  Apple   lemon   2002-02-01
9   2   1   12  Apple   Orange  2002-02-01

s = df.groupby('d1')['c'].expanding().mean().sort_index(level=1)

输出:

Apple  0    19.000000
Mango  1    17.000000
Apple  2    11.500000
       3    12.666667
Mango  4    12.500000
       5     8.666667
Apple  6    13.000000
Mango  7     9.000000
Apple  8    12.400000
       9    12.333333

这时我们需要做什么?这些平均值正确吗?

如果我使用s.diff().ge(0)比较平均值,则会得到:

Apple  0    0
Mango  1    0
Apple  2    0
       3    1
Mango  4    0
       5    0
Apple  6    1
Mango  7    0
Apple  8    1
       9    0

IIUC,

看看这个:

df.groupby('d1')['c'].expanding().mean().sort_index(level=1)

输出:

Apple  0    4.00  #4
Mango  1    6.00  #6
Apple  2    6.50  #9+4 / 2
Mango  3    7.00  #6 + 8 / 2
       4    7.00  #6 + 8 + 7 / 3
       5    7.25  #6 + 8 + 7 + 8 / 4
Apple  6    7.00  #4 + 9 + 8 / 3
Name: c, dtype: float64

现在,让我们与上一行进行比较:

df.groupby('d1')['c'].expanding().mean().sort_index(level=1).diff().ge(0).astype(int)

输出:

d1      
Apple  0    0
Mango  1    1
Apple  2    1
Mango  3    1
       4    1
       5    1
Apple  6    0
Name: c, dtype: int32

或者您可能需要将芒果与苹果的最后一个价值进行比较...。

df.groupby('d1')['c'].expanding().mean().sort_index(level=1).unstack(0).ffill()

输出:

d1  Apple  Mango
0     4.0    NaN
1     4.0   6.00
2     6.5   6.00
3     6.5   7.00
4     6.5   7.00
5     6.5   7.25
6     7.0   7.25

但是,我无法满足您的预期输出:

df.groupby('d1')['c'].expanding().mean().sort_index(level=1).unstack(0).ffill().eval('rank= Mango >= Apple')

输出:

d1  Apple  Mango   rank
0     4.0    NaN  False
1     4.0   6.00   True
2     6.5   6.00  False
3     6.5   7.00   True
4     6.5   7.00   True
5     6.5   7.25   True
6     7.0   7.25   True