Question

这是我的数据框的一个例子：

df_lst = [
  {"wordcount": 100, "Stats": 198765, "id": 34},
     {"wordcount": 99, "Stats": 98765, "id": 35},
     {"wordcount": 200, "Stats": 18765, "id": 36},
     {"wordcount": 250, "Stats": 788765, "id": 37},
     {"wordcount": 345, "Stats": 12765, "id": 38},
     {"wordcount": 456, "Stats": 238765, "id": 39},
     {"wordcount": 478, "Stats": 1934, "id": 40},
     {"wordcount": 890, "Stats": 19845, "id": 41},
     {"wordcount": 812, "Stats": 1987, "id": 42}]
df = pd.DataFrame(df_lst)
df.set_index('id', inplace=True)
df.head()

DF：

    Stats   wordcount
id      
34  198765  100
35  98765   99
36  18765   200
37  788765  250
38  12765   345

我想计算Stats的每个范围的平均值wordcount，步长为100，因此新数据框看起来像这样：

    Average wordcount
    194567  100
    23456   200
    2378    300
    ...

其中100表示0-100等。我开始编写多个条件，但感觉有更有效的方法来实现这一点。非常感谢您的帮助。

Answer 1

使用pd.cut()方法：

In [92]: bins = np.arange(0, df['wordcount'].max().round(-2) + 100, 100)

In [94]: df.groupby(pd.cut(df['wordcount'], bins=bins, labels=bins[1:]))['Stats'].mean()
Out[94]:
wordcount
100    148765.0
200     18765.0
300    788765.0
400     12765.0
500    120349.5
600         NaN
700         NaN
800         NaN
900     10916.0
Name: Stats, dtype: float64

Answer 2

import math
def roundup(x):
    return int(math.ceil(x / 100.0)) * 100
df['roundup']=df.wordcount.apply(roundup)
df.groupby('roundup').Stats.mean()
Out[824]: 
roundup
100    148765.0
200     18765.0
300    788765.0
400     12765.0
500    120349.5
900     10916.0
Name: Stats, dtype: float64

熊猫：按等距离分组

2 个答案: