我有这样的df。
import pandas as pd
import numpy as np
user = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK','UK','UK']
,'Count':[85,78,70,5,6,8,60,30,5,6,5,4]})
我想对计数列进行排序,然后将前30%的行分配给组3,然后将下30%的行分配给2,其余30%的行分配给组1。我该怎么做。这是我的预期输出。前4列。并查看我的计算结果我如何划分30%,30%,40%
答案 0 :(得分:1)
您首先需要按sort_values
对列进行排序,然后对groupby
和具有自定义功能的numpy.split
进行排序,并将每个组的长度返回到新DataFrame的新行:
完美MaxU answer的想法,谢谢。
用于顶部30-30-30
:
user = user.sort_values(['User','Count'], ascending=[True, False])
def f(x):
#split to 4 groups, because 3 + 3 + 3 != 1
a, b, c, d = np.split(x, [int(.3*len(x)), int(.6*len(x)), int(.9*len(x))])
return pd.Series([len(a), len(b), len(c)], index=['30','30','30'])
df = user.groupby('User').apply(f)
df['sum'] = df.sum(axis=1)
print (df)
30 30 30 sum
User
101 1 2 1 4
102 2 2 2 6
和30-30-40
:
user = user.sort_values(['User','Count'], ascending=[True, False])
def f(x):
#split to 3 groups, because 3 + 3 + 4 == 1
a, b, c = np.split(x, [int(.3*len(x)), int(.6*len(x))])
return pd.Series([len(a), len(b), len(c)], index=['30','30','40'])
df = user.groupby('User').apply(f)
df['sum'] = df.sum(axis=1)
print (df)
30 30 40 sum
User
101 1 2 2 5
102 2 2 3 7
编辑:
组应由list comprehension
创建:
def f(x):
a, b, c = np.split(x.index, [int(.3*len(x)), int(.6*len(x))])
L = [a,b,c]
return [i for i, y in zip(range(len(L),0,-1) ,L) for j in y]
user['Groups'] = user.groupby('User')['User'].transform(f)
print (user)
User Country Count Groups
0 101 India 85 3
1 101 Japan 78 2
2 101 India 70 2
6 101 Austria 60 1
5 101 UK 8 1
7 102 Japan 30 3
4 102 Japan 6 3
9 102 UK 6 2
3 102 Brazil 5 2
8 102 Singapore 5 1
10 102 UK 5 1
11 102 UK 4 1