Question

我有以下数据框，如何创建一个新列，使城市代表所有值的80％？在这种情况下，它们是“ a”，“ b”和“ c”。其余城市应贴上“其他”标签。

values = ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','d','d','d','e','e','f']
db = pd.DataFrame(values,columns = ['city'])

db['city'].value_counts(normalize=True)

a    0.32
b    0.24
c    0.20
d    0.12
e    0.08
f    0.04

所需的输出

db['city_freq'] = ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','other','other','other','other','other','other']

Answer 1

使用条件Series.cumsum过滤所有具有累积和的值，获得index值，然后将Series.isin与DataFrame.loc的原始值进行比较以替换值：

s = db['city'].value_counts(normalize=True).cumsum()

print (s)
a    0.32
b    0.56
c    0.76
d    0.88
e    0.96
f    1.00

print (s.index[s > 0.8])
Index(['d', 'e', 'f'], dtype='object')

db.loc[db['city'].isin(s.index[s > 0.8]), 'city'] = 'other'
print (db)
     city
0       a
1       a
2       a
3       a
4       a
5       a
6       a
7       a
8       b
9       b
10      b
11      b
12      b
13      b
14      c
15      c
16      c
17      c
18      c
19  other
20  other
21  other
22  other
23  other
24  other

另一种解决方案，其中Series.map是通过累计总和然后按阈值进行比较的：

s = db['city'].value_counts(normalize=True).cumsum()

db.loc[db['city'].map(s) > 0.8, 'city'] = 'other'

详细信息：

print (db['city'].map(s))
0     0.32
1     0.32
2     0.32
3     0.32
4     0.32
5     0.32
6     0.32
7     0.32
8     0.56
9     0.56
10    0.56
11    0.56
12    0.56
13    0.56
14    0.76
15    0.76
16    0.76
17    0.76
18    0.76
19    0.88
20    0.88
21    0.88
22    0.96
23    0.96
24    1.00
Name: city, dtype: float64

Python-熊猫百分比分布少于80％

1 个答案: