使用pandas将值分组为具有最小大小的组

时间:2014-01-14 13:58:48

标签: python pandas

我试图将观察样本分类到 n 离散组中,然后将这些组合起来,直到每个子组最少有6个成员。到目前为止,我已经生成了垃圾箱,并将我的DataFrame分组到它们中:

# df is a DataFrame containing 135 measurments
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
grp.size()

1      4
2      1
3      2
4      3
5      2
6      8
7      7
8      6
9     19
10    12
11    13
12    12
13     7
14    12
15    12
16     2
17     3
18     6
19     3
21     1

所以我可以看到我需要将1-3组,3 - 5和16 - 21组合在一起,同时保留其他组,但我不知道如何以编程方式进行此操作。

1 个答案:

答案 0 :(得分:2)

你可以这样做:

df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights'])
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
sizes = grp.size() 

def f(vals, max):
    sum = 0
    group = 1
    for v in vals:
        sum += v
        if sum <= max:
            yield group
        else:
            group +=1
            sum = v
            yield group

#I've changed 6 by 30 for the example cause I don't have your original dataset
grp.size().groupby([g for g in f(sizes, 30)])

如果您执行print grp.size().groupby([g for g in f(sizes, 30)]).cumsum(),您会看到累计金额按预期分组。

此外,如果您想对原始值进行分组,您可以执行以下操作:

dat = np.random.random_integers(0,200,135)
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134,
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166,
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81,
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64,
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162,
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175,
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158,
83,155,161,29,197,143,122,72,60])
df = pd.DataFrame({'heights':dat})
bins = np.digitize(dat,np.linspace(0,200,21))
grp = df.heights.groupby(bins)

m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f(x):
    global c,s
    res = pd.Series([c]*x.size,index=x.index)
    s += x.size
    if s>m:
        s = 0
        c += 1
    return res
g = grp.apply(f)
print df.groupby(g).size()

#another way of doing the same, just a matter of taste

m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f2(x):
    global c,s
    res = [c]*x.size #here is the main difference with f
    s += x.size
    if s>m:
        s = 0
        c += 1
    return res

g = grp.transform(f2) #call it this way
print df.groupby(g).size()