我的数据框类似于以下内容:
import pandas as pd
entry_1 = pd.Series({'State': 'State1', 'County': 'name1', 'Population': 10})
entry_2 = pd.Series({'State': 'State1', 'County': 'name12', 'Population': 8})
entry_3 = pd.Series({'State': 'State1', 'County': 'name13', 'Population': 7})
entry_4 = pd.Series({'State': 'State1', 'County': 'name14', 'Population': 6})
entry_5 = pd.Series({'State': 'State2', 'County': 'name15', 'Population': 10})
entry_6 = pd.Series({'State': 'State2', 'County': 'name16', 'Population': 8})
entry_7 = pd.Series({'State': 'State2', 'County': 'name17', 'Population': 7})
entry_8 = pd.Series({'State': 'State2', 'County': 'name18', 'Population': 6})
entry_9 = pd.Series({'State': 'State3', 'County': 'name19', 'Population': 10})
entry_10 = pd.Series({'State': 'State3', 'County': 'name10', 'Population':8})
entry_11 = pd.Series({'State': 'State3', 'County': 'name11', 'Population':7})
entry_12 = pd.Series({'State': 'State3', 'County': 'name12', 'Population':6})
entry_13 = pd.Series({'State': 'State4', 'County': 'name13', 'Population':1})
entry_14 = pd.Series({'State': 'State4', 'County': 'name14', 'Population':2})
entry_15 = pd.Series({'State': 'State4', 'County': 'name15', 'Population':3})
df = pd.DataFrame([
entry_1, entry_2,entry_3,entry_4,entry_5,entry_6,entry_7,
entry_8,entry_9,entry_10,entry_11,entry_12, entry_13, entry_14, entry_15])
df.head()
使用州内三个人口最多的县,我需要找到 人口最多的三个州,按人口最多的顺序排列 最低。
虽然我没有得到预期的结果,但我做了一个我认为有意义的尝试。我无法用示例代码复制这个,但我想也许有丢失的值,或类似的东西会导致计算失效..
df['SUM_OF_TOP'] = df.groupby('State')['Population'].nlargest(3).sum(level=1)
largest_States = df['SUM_OF_TOP'].nlargest(3).index
[df.loc[idx]['State'] for idx in largest_States]
>>> ['State1', 'State2', 'State3']
有关可能出错的建议吗?我刚刚开始与熊猫开始,所以我很无能为力......
答案 0 :(得分:3)
IIUC你可以这样做:
In [69]: df.loc[8, 'Population'] = 11
In [70]: df.loc[[5,6], 'Population'] = 9
In [71]: df
Out[71]:
County Population State
0 name1 10 State1
1 name12 8 State1
2 name13 7 State1
3 name14 6 State1
4 name15 10 State2
5 name16 9 State2 # changed Population value: 8 --> 9
6 name17 9 State2 # changed Population value: 7 --> 9
7 name18 6 State2
8 name19 11 State3 # changed Population value: 10 --> 11
9 name10 8 State3
10 name11 7 State3
11 name12 6 State3
12 name13 1 State4
13 name14 2 State4
14 name15 3 State4
In [72]: df.groupby('State')['Population'].nlargest(3).sum(level=0).nlargest(3).index.tolist()
Out[72]: ['State2', 'State3', 'State1']
In [73]: df.groupby('State')['Population'].nlargest(3).sum(level=0)
Out[73]:
State
State1 25
State2 28
State3 26
State4 6
Name: Population, dtype: int64
或者你可以这样做:
In [80]: df.groupby('State')['Population'].nlargest(3).sum(level=0).sort_values(ascending=0).head(3).index
Out[80]: Index(['State2', 'State3', 'State1'], dtype='object', name='State')
答案 1 :(得分:2)
This was my answer to this question on code review:
绝对要使用nlargest
nlargest
的优点是它在线性时间内执行部分排序。
但是,您不想groupby
两次。因此,我们将定义一个辅助函数,我们只需groupby
一次。
我使用了大量.values
来访问基础numpy
对象。这会稍微提高一些效率。
最后,我不想无目的地从外部范围访问名称,因此我将数据框作为参数传递。
def answer_six(df):
# subset df to things I care about
sumlev = df.SUMLEV.values == 50
data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]
# build a pandas series with State and County in the index
# vaues are from CENSUS2010POP
s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)
# define a function that does the nlargest and sum in one
# otherwise you'd have to do a second groupby
def sum_largest(x, n=3):
return x.nlargest(n).sum()
return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()
演示
answer_six(census_df)
['California', 'Texas', 'Illinois']