Python使用其他列值创建具有最高值(%)的新列

时间:2016-12-20 08:25:53

标签: python sorting pandas

我无法找到自动获得最高n%的函数,因此我将最大值和最小值以及计算出的数字排序为前25%和最小25%范围。我想要做的是在新栏目中创建一个标志,说让我们说这个客户的收入在前25%。

from heapq import nsmallest
top_max = avg_cust_data.nlargest(10806, ['user_spendings'])
top_min = avg_cust_data.nsmallest(10806, ['user_spendings'])

avg_cust_data['spendings_flag'] = np.where(avg_cust_data['user_spendings'] = top_max, 'Top Max',
                                  np.where(avg_cust_data['user_spendings'] = top_min, 'Top Min', 'AVG'))

2 个答案:

答案 0 :(得分:5)

您可以使用:

np.random.seed(100)
avg_cust_data = pd.DataFrame(np.random.random((40,1)), columns=['user_spendings'])
print (avg_cust_data)


top_max = avg_cust_data['user_spendings'].nlargest(10)
top_min = avg_cust_data['user_spendings'].nsmallest(10)


avg_cust_data['spendings_flag'] = 
np.where(avg_cust_data.index.isin(top_max.index) , 'Top Max',
np.where(avg_cust_data.index.isin(top_min.index), 'Top Min', 'AVG'))

另一种解决方案:

df1 = avg_cust_data.describe()
top_max_treshold = df1.loc['25%', 'user_spendings']
top_min_treshold = df1.loc['75%', 'user_spendings']
print (top_max_treshold)

avg_cust_data = avg_cust_data.sort_values('user_spendings')
avg_cust_data['spendings_flag1'] = 
np.where(avg_cust_data['user_spendings'] <= top_max_treshold , 'Top Min',
np.where(avg_cust_data['user_spendings'] >= top_min_treshold, 'Top Max', 'AVG'))


print (avg_cust_data)

答案 1 :(得分:2)

使用pd.qcut

np.random.seed([3,1415])
avg_cust_data = pd.DataFrame(np.random.random((16,1)), columns=['user_spendings'])
avg_cust_data['quartiles'] = pd.qcut(
    avg_cust_data.user_spendings, 4,
    ['Quartile %s' %i for i in range(1, 5)]
)
avg_cust_data

enter image description here

您甚至可以通过百分位数和相应的标签自定义bin边缘

np.random.seed([3,1415])
avg_cust_data = pd.DataFrame(np.random.random((16,1)), columns=['user_spendings'])
avg_cust_data['quartiles'] = pd.qcut(
    avg_cust_data.user_spendings, [0., .25, .75, 1.],
    ['Bottom 25%', 'Middle', 'Top 25%']
)
avg_cust_data

enter image description here