我有一个名为ranks的熊猫数据框,其中包含我的群集及其关键指标。我使用rank()
对它们进行排名,但是我想将两个特定的集群与其他集群进行排名。
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
然后我像这样应用rank()
方法:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
哪个给我这个:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
这是应该做的事情,但是Ave. Spend
最高的集群必须始终排在第1位,Ave. Recency
最高的集群必须一直排在最后。
所以我修改了上面的代码,使其看起来像这样:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
那我明白了
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
请帮助我修改以上if语句,或者完全建议使用其他方法。这当然需要尽可能地动态。
答案 0 :(得分:2)
因此,您需要在数据帧上进行自定义排名,其中Ave. Spend
最高的集群(/行)始终排名1,Ave. Recency
最高的集群(/行)始终排名最后。
解决方案是五行。注意:
DataFrame.drop()
的想法是正确的,只需使用idxmax()
即可获取需要特殊处理的两行的索引并将其存储,因此您无需费劲drop
中的逻辑过滤器表达式。ranks_2 = ranks.drop(...)
;只需将drop()
的结果传递到rank()
... .sum(axis=1)
,无需定义Lambda或将其输出保存在临时列'overall
'中。rank()
,这将为我们提供1..8的值,因此我们加1以抵消rank()
的结果为2..9。 (您可以对此进行概括)。Ave. Spend
,Ave. Recency
行手动设置'overall_rank'。Ave.
列,否则为四个*_rank
列。)代码:(请参见底部的样板,以在您的数据框中读取,下次请制作示例MCVE,以帮助我们为您提供帮助>
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
这是提取数据的样板:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))