Question

我正在学习大熊猫，并在该模块中如何组织数据。

我按照教程和文档来处理一个基本任务：在垃圾箱中出现状态（＆＃39;颜色＆＃39;）的百分比（＆＃39;网站＆＃39;）。下面的代码有希望澄清我的想法和想要做的事情：

import pandas as pd
import random

# example of a few first entries generated below: 
# [('site2', 'red'), ('site3', 'red'), ('site1', 'yellow'), ...
sites = ['site1', 'site2', 'site3']
colors = ['red', 'blue', 'yellow']
d = []
for i in range(0,100):
    s = (
        sites[random.randint(0, 2)],
        colors[random.randint(0, 2)],
    )
    d.append(s)

df = pd.DataFrame(d)
df.columns = ['site', 'color']

grouped = df.groupby(['site', 'color'])
p = grouped.size()

# the whole group
print(p)
# the number of instances of 'blue' in 'site2'
print(p['site2']['blue'])
# the total number of instances for 'site2'
print(p['site2'].sum())

输出符合预期：＆＃34;对于给定的网站，显示具有特定颜色的事件数量＆＃34;

site   color 
site1  blue      16
       red       11
       yellow     6
site2  blue       9
       red       12
       yellow    12
site3  blue      11
       red        7
       yellow    16
dtype: int64
9
33

我想要实现的是在分组数据中生成一个新列，其中包含给定站点的给定颜色的百分比。实际上，这将是上面的例子

site1  blue      16 48.4
       red       11 33.3
       yellow     6 18.2
site2  blue       9 27.3
(...)

我清楚地有数字来进行计算（最后两个输出是一个例子），我不知道如何实际循环通过组来添加计算的百分比。

p = grouped.size()类型为Series。我能以某种方式用计算的百分比来丰富它吗？

Answer 1

这可以通过将size除以索引第一级的sum来计算：

In [38]:

grouped.size() / grouped.size().sum(level=0) * 100
Out[38]:
site   color 
site1  blue      25.714286
       red       45.714286
       yellow    28.571429
site2  blue      32.432432
       red       43.243243
       yellow    24.324324
site3  blue      32.142857
       red       39.285714
       yellow    28.571429
dtype: float64

当然，由于随机输入值，我上面的输出会与你的输出不同。

修改

通过以下方式传递您想要总和的级别的名称更具可读性：

In [46]: grouped.size() / grouped.size().sum(level='site') * 100 Out[46]: site color site1 blue 25.714286 red 45.714286 yellow 28.571429 site2 blue 32.432432 red 43.243243 yellow 24.324324 site3 blue 32.142857 red 39.285714 yellow 28.571429 dtype: float64

如何为分组数据添加百分比？

1 个答案: