groupby应用所有其他键的操作

时间:2018-01-10 13:39:08

标签: python pandas group-by

鉴于大熊猫数据框select c.*, (c.ceaQty/(c.startQty*c.BOX)) calccolumn1, (c.boxQty/c.StartQty) calccolumn2 from (SELECT dbo.WO.WO, WO.WorkOrderNumber, WO.StartQty, WO.Assembly, Standards.[Mfg Family], Standards.BOX, (SELECT SUM(ScanData.Quantity) FROM ScanData WHERE (ScanData.Task = 'CEA' AND (WO.WorkOrderNumber = ScanData.WorkOrderNumber)) ) AS ceaQty, (SELECT SUM(ScanData.Quantity) FROM ScanData WHERE (ScanData.Task = 'Boxing' AND (Wo.WorkOrderNumber = ScanData.WorkOrderNumber)) ) AS boxQty FROM Standards INNER JOIN WO ON Standards.Product = WO.Assembly WHERE WO.Status != 'C' AND WO.WO LIKE '00%' AND (WO.Assembly LIKE '%SII%' OR WO.Assembly LIKE '%SWW%' ) ) c ORDER BY c.WO DESC, c.[Mfg Family] DESC 我可以df来获得每个年龄段的平均阅读能力。

现在假设我想要除df.groupby('Age').apply(lambda x: x['ReadingAbility'].mean())

以外的所有年龄段的平均阅读能力

我能做到:

age=k

这在某种程度上与groupby + apply相反 是否有任何捷径可以更有效地实现相同的结果?

请参阅以下示例:

mu_other_ages = {}
for age in df['Age'].unique():
 mu_other_ages[age] = df[df['Age'] != age]['ReadingAbility'].mean()

在这些情况下,只有2个不同的年龄值,结果应该反转为:In [52]: d = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility']) In [53]: In [53]: d Out[53]: Age ReadingAbility 0 1 10 1 2 4 2 1 9 3 2 3 In [54]: d.groupby('Age').apply(lambda x: x['ReadingAbility'].mean()) Out[54]: Age 1 9.5 2 3.5 dtype: float64 2=9.5,而对于更多的类,1=3.5的值应为:{{ 1}}

为了澄清此示例的预期结果:Age=k

3 个答案:

答案 0 :(得分:2)

你需要:

a = (d.groupby('Age')
      .apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))

print (a)
Age
1    3.5
2    9.5
dtype: float64

另一个非常快速的解决方案是每个组的汇总sumsize,然后减去sub两列的总和。最后的分歧:

np.random.seed(45)
d = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['Age', 'ReadingAbility']) 
print (d)
   Age  ReadingAbility
0    3               0
1    5               3
2    4               9
3    8               1
4    5               9
5    6               8
6    7               8
7    5               2
8    8               1
9    6               4
a = (d.groupby('Age')
      .apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))

print (a)
Age
3    5.000000
4    4.000000
5    4.428571
6    4.125000
7    4.111111
8    5.375000
c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
print (c)
     size  sum
Age           
3       1    0
4       1    9
5       3   14
6       2   12
7       1    8
8       2    2

e = c.rsub(c.sum())
e = e['sum'] / e['size']
print (e)
Age
3    5.000000
4    4.000000
5    4.428571
6    4.125000
7    4.111111
8    5.375000
dtype: float64

<强>计时

np.random.seed(45)
N = 100000
d = pd.DataFrame(np.random.randint(1000, size=(N, 2)), columns=['Age', 'ReadingAbility']) 
#print (d)


In [30]: %timeit (d.groupby('Age').apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))
1 loop, best of 3: 1.27 s per loop


In [31]: %%timeit
    ...: c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
    ...: #print (c)
    ...: e = c.sub(c.sum())
    ...: e = e['sum'] / e['size']
    ...: 
100 loops, best of 3: 6.28 ms per loop

答案 1 :(得分:1)

d.groupby("Age")['ReadingAbility'].mean()

获得每个组的平均值。您可以通过添加类似

的查询来过滤掉Age = 1
d.groupby("Age")['ReadingAbility'].mean().reset_index().query("Age != 1")

d.groupby("Age")['ReadingAbility'].mean().select(lambda x: x != 1, axis=0)

或者,正如Merkle Daamgard指出的那样,您先筛选出您不需要的值,然后groupbymean

d.query("Age != 1").groupby("Age")['ReadingAbility'].mean()
d.loc[d.Age != 1].groupby("Age")['ReadingAbility'].mean()
d.where(d.Age != 1).groupby("Age")['ReadingAbility'].mean()

有关详细信息,请参阅GroupBy.mean

答案 2 :(得分:0)

我认为你可以选择那个

df = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility'])
res = df.loc[df['Age'] != 1].groupby('Age').apply(lambda x: x['ReadingAbility'].mean())
print res

返回:

  

年龄:   2 3.5