Question

我有数据框：

import pandas as pd
id = [0,0,0,0,1,1,1,1]
color = ['red','blue','red','black','blue','red','black','black']
test = pd.DataFrame(zip(id, color), columns = ['id', 'color'])

并且想要创建一个按ID分组的唯一颜色的运行计数列，以便最终的数据框如下所示：

   id  color  expanding_unique_count
0   0    red                       1
1   0   blue                       2
2   0    red                       2
3   0  black                       3
4   1   blue                       1
5   1    red                       2
6   1  black                       3
7   1  black                       3

我尝试了这个简单的方法：

def len_unique(x):
    return(len(np.unique(x)))

test['expanding_unique_count'] = test.groupby('id')['color'].apply(lambda x: pd.expanding_apply(x, len_unique))

得到ValueError：无法将字符串转换为float：black

如果我将颜色更改为整数：

color = [1,2,1,3,2,1,3,3]

test = pd.DataFrame(zip(id, color), columns = ['id', 'color'])

然后运行上面相同的代码会产生所需的结果。有没有办法在保持列颜色的字符串类型的同时工作？

Answer 1

看起来expanding_apply和rolling_apply主要用于数值。也许尝试创建一个数字列来将颜色字符串编码为数字值（这可以通过make color column categorical来完成），然后expanding_apply。

# processing
# ===================================
# create numeric label
test['numeric_label'] = pd.Categorical(test['color']).codes
# output: array([2, 1, 2, 0, 1, 2, 0, 0], dtype=int8)

# your expanding function
test['expanding_unique_count'] = test.groupby('id')['numeric_label'].apply(lambda x: pd.expanding_apply(x, len_unique))
# drop the auxiliary column
test.drop('numeric_label', axis=1)

   id  color  expanding_unique_count
0   0    red                       1
1   0   blue                       2
2   0    red                       2
3   0  black                       3
4   1   blue                       1
5   1    red                       2
6   1  black                       3
7   1  black                       3

编辑：

def func(group):
    return pd.Series(1, index=group.groupby('color').head(1).index).reindex(group.index).fillna(0).cumsum()

test['expanding_unique_count'] =  test.groupby('id', group_keys=False).apply(func)
print(test)

   id  color  expanding_unique_count
0   0    red                       1
1   0   blue                       2
2   0    red                       2
3   0  black                       3
4   1   blue                       1
5   1    red                       2
6   1  black                       3
7   1  black                       3

Pandas：使用groupby扩展_apply以获得字符串类型的唯一计数

1 个答案:

编辑：