Question

这是我的代码

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
income = DataFrame({'name': ['Adam', 'Bill', 'Chris', 'Dave', 'Edison', 'Frank'],
                    'age': [22, 24, 31, 45, 51, 55],
                    'income': [1000, 2500, 1200, 1500, 1300, 1600],
                    })
ageBin = pd.cut(income.age, [20, 30, 40, 50, 60])
grouped = income.groupby([ageBin])
highestIncome = income.ix[grouped.income.idxmax()]

我有一个包含名称，年龄和收入的DataFrame，如下所示：

index   age income  name
0   22  1000    Adam
1   24  2500    Bill
2   31  1200    Chris
3   45  1500    Dave
4   51  1300    Edison
5   55  1600    Frank

我想按年龄分组对数据进行分组，并收集收入最高的记录。上面的代码有效，highestIncome是：

index   age income  name
1   24  2500    Bill
2   31  1200    Chris
3   45  1500    Dave
5   55  1600    Frank

但是，如果我删除了Chris的记录，因此在（30,40）的年龄范围内没有记录，我会在ValueError获得grouped.income.idxmax()我认为这是因为NaN已分组，但我无法找到解决问题的方法。任何输入都会受到赞赏。

更新：非常感谢您的回答。我相信这是对于groupby对象的idxmax（）的错误。我想对agg(lambda x: x.idxmax())方法进行测试，因为我测试了在1000万个合成数据集上使用sort() vs agg(lambda x: x.idxmax()的速度。这是代码和输出：

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import time

testData = DataFrame({'key': np.random.randn(10000000),
                      'value': np.random.randn(10000000)})
keyBin = pd.cut(testData.key, 1000)

start = time.time()
grouped1 = testData.sort('value', ascending=False).groupby([keyBin])
highestValues1 = testData.ix[grouped1.head(1).index]
end = time.time()
print end - start

start = time.time()
grouped2 = testData.groupby([keyBin])
highestValues2 = testData.ix[grouped2.value.agg(lambda x: x.idxmax())].dropna(how='all')
end = time.time()
print end - start
#validation
(highestValues1.sort() == highestValues2.sort()).all()

输出：

5.30953717232
1.0279238224

Out[47]:

key      True
value    True
dtype: bool

Answer 1

grouped['income'].agg(lambda x : x.idxmax())


Out[]:
age
(20, 30]     1
(30, 40]   NaN
(40, 50]     2
(50, 60]     4
Name: income, dtype: float64

然后您可以执行以下操作来获取数据

income.ix[result.values].dropna()

Answer 2

由于<item name="colorPrimary">#YOUR_COLOR_CODE</item>会保留每个组中的行顺序，因此您需要在groupby之前对income进行排序。然后，使用groupby：

获取第一

head

顺便提一下，请注意参考手册没有提及grouped=income.sort('income', ascending=False).groupby([ageBin]) highestIncome = income.ix[grouped.head(1).index] #highestIncome is no longer ordered by age. #If you want to recover this, sort it again. highestIncome.sort('age', inplace=True)将保留订单。我认为最干净的解决方案是修复pandas的groupby。对我来说，idxmax工作时idxmax无效的原因有点奇怪。

Answer 3

只需在组上应用lambda函数，如下所示：

grouped.apply(lambda x: x.max())

idxmax（）不适用于包含NaN的SeriesGroupBy

3 个答案: