熊猫:群体中的前n名,以及这些群组中的前n名

时间:2017-01-14 10:32:36

标签: pandas

我的数据框类似于以下内容:

import pandas as pd
entry_1 = pd.Series({'State': 'State1', 'County': 'name1', 'Population': 10})
entry_2 = pd.Series({'State': 'State1', 'County': 'name12', 'Population': 8})
entry_3 = pd.Series({'State': 'State1', 'County': 'name13', 'Population': 7})
entry_4 = pd.Series({'State': 'State1', 'County': 'name14', 'Population': 6})
entry_5 = pd.Series({'State': 'State2', 'County': 'name15', 'Population': 10})
entry_6 = pd.Series({'State': 'State2', 'County': 'name16', 'Population': 8})
entry_7 = pd.Series({'State': 'State2', 'County': 'name17', 'Population': 7})
entry_8 = pd.Series({'State': 'State2', 'County': 'name18', 'Population': 6})
entry_9 = pd.Series({'State': 'State3', 'County': 'name19', 'Population': 10})
entry_10 = pd.Series({'State': 'State3', 'County': 'name10', 'Population':8})
entry_11 = pd.Series({'State': 'State3', 'County': 'name11', 'Population':7})
entry_12 = pd.Series({'State': 'State3', 'County': 'name12', 'Population':6})
entry_13 = pd.Series({'State': 'State4', 'County': 'name13', 'Population':1})
entry_14 = pd.Series({'State': 'State4', 'County': 'name14', 'Population':2})
entry_15 = pd.Series({'State': 'State4', 'County': 'name15', 'Population':3})
df = pd.DataFrame([
    entry_1, entry_2,entry_3,entry_4,entry_5,entry_6,entry_7,
    entry_8,entry_9,entry_10,entry_11,entry_12, entry_13, entry_14, entry_15])
df.head()
  

使用州内三个人口最多的县,我需要找到   人口最多的三个州,按人口最多的顺序排列   最低。

虽然我没有得到预期的结果,但我做了一个我认为有意义的尝试。我无法用示例代码复制这个,但我想也许有丢失的值,或类似的东西会导致计算失效..

df['SUM_OF_TOP'] = df.groupby('State')['Population'].nlargest(3).sum(level=1)
largest_States = df['SUM_OF_TOP'].nlargest(3).index
[df.loc[idx]['State'] for idx in largest_States]
>>> ['State1', 'State2', 'State3']

有关可能出错的建议吗?我刚刚开始与熊猫开始,所以我很无能为力......

2 个答案:

答案 0 :(得分:3)

IIUC你可以这样做:

In [69]: df.loc[8, 'Population'] = 11

In [70]: df.loc[[5,6], 'Population'] = 9

In [71]: df
Out[71]:
    County  Population   State
0    name1          10  State1
1   name12           8  State1
2   name13           7  State1
3   name14           6  State1
4   name15          10  State2
5   name16           9  State2  # changed Population value: 8 --> 9
6   name17           9  State2  # changed Population value: 7 --> 9
7   name18           6  State2
8   name19          11  State3  # changed Population value: 10 --> 11
9   name10           8  State3
10  name11           7  State3
11  name12           6  State3
12  name13           1  State4
13  name14           2  State4
14  name15           3  State4

In [72]: df.groupby('State')['Population'].nlargest(3).sum(level=0).nlargest(3).index.tolist()
Out[72]: ['State2', 'State3', 'State1']

In [73]: df.groupby('State')['Population'].nlargest(3).sum(level=0)
Out[73]:
State
State1    25
State2    28
State3    26
State4     6
Name: Population, dtype: int64

或者你可以这样做:

In [80]: df.groupby('State')['Population'].nlargest(3).sum(level=0).sort_values(ascending=0).head(3).index
Out[80]: Index(['State2', 'State3', 'State1'], dtype='object', name='State')

答案 1 :(得分:2)

This was my answer to this question on code review:

SUMLEV is explained here

绝对要使用nlargest
nlargest的优点是它在线性时间内执行部分排序。

但是,您不想groupby两次。因此,我们将定义一个辅助函数,我们只需groupby一次。

我使用了大量.values来访问基础numpy对象。这会稍微提高一些效率。

最后,我不想无目的地从外部范围访问名称,因此我将数据框作为参数传递。

def answer_six(df):
    # subset df to things I care about
    sumlev = df.SUMLEV.values == 50
    data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]

    # build a pandas series with State and County in the index
    # vaues are from CENSUS2010POP
    s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)

    # define a function that does the nlargest and sum in one
    # otherwise you'd have to do a second groupby
    def sum_largest(x, n=3):
        return x.nlargest(n).sum()

    return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()

演示

answer_six(census_df)

['California', 'Texas', 'Illinois']