鉴于大熊猫数据框select c.*, (c.ceaQty/(c.startQty*c.BOX)) calccolumn1, (c.boxQty/c.StartQty) calccolumn2
from
(SELECT dbo.WO.WO,
WO.WorkOrderNumber,
WO.StartQty,
WO.Assembly,
Standards.[Mfg Family],
Standards.BOX,
(SELECT SUM(ScanData.Quantity)
FROM
ScanData
WHERE
(ScanData.Task = 'CEA' AND (WO.WorkOrderNumber = ScanData.WorkOrderNumber))
) AS ceaQty,
(SELECT SUM(ScanData.Quantity)
FROM
ScanData
WHERE
(ScanData.Task = 'Boxing' AND (Wo.WorkOrderNumber = ScanData.WorkOrderNumber))
) AS boxQty
FROM
Standards
INNER JOIN WO ON Standards.Product = WO.Assembly
WHERE
WO.Status != 'C' AND
WO.WO LIKE '00%' AND
(WO.Assembly LIKE '%SII%' OR WO.Assembly LIKE '%SWW%' )
) c
ORDER BY
c.WO DESC,
c.[Mfg Family] DESC
我可以df
来获得每个年龄段的平均阅读能力。
现在假设我想要除df.groupby('Age').apply(lambda x: x['ReadingAbility'].mean())
我能做到:
age=k
这在某种程度上与groupby + apply相反 是否有任何捷径可以更有效地实现相同的结果?
请参阅以下示例:
mu_other_ages = {}
for age in df['Age'].unique():
mu_other_ages[age] = df[df['Age'] != age]['ReadingAbility'].mean()
在这些情况下,只有2个不同的年龄值,结果应该反转为:In [52]: d = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility'])
In [53]:
In [53]: d
Out[53]:
Age ReadingAbility
0 1 10
1 2 4
2 1 9
3 2 3
In [54]: d.groupby('Age').apply(lambda x: x['ReadingAbility'].mean())
Out[54]:
Age
1 9.5
2 3.5
dtype: float64
和2=9.5
,而对于更多的类,1=3.5
的值应为:{{ 1}}
为了澄清此示例的预期结果:Age=k
答案 0 :(得分:2)
你需要:
a = (d.groupby('Age')
.apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))
print (a)
Age
1 3.5
2 9.5
dtype: float64
另一个非常快速的解决方案是每个组的汇总sum
和size
,然后减去sub
两列的总和。最后的分歧:
np.random.seed(45)
d = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['Age', 'ReadingAbility'])
print (d)
Age ReadingAbility
0 3 0
1 5 3
2 4 9
3 8 1
4 5 9
5 6 8
6 7 8
7 5 2
8 8 1
9 6 4
a = (d.groupby('Age')
.apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))
print (a)
Age
3 5.000000
4 4.000000
5 4.428571
6 4.125000
7 4.111111
8 5.375000
c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
print (c)
size sum
Age
3 1 0
4 1 9
5 3 14
6 2 12
7 1 8
8 2 2
e = c.rsub(c.sum())
e = e['sum'] / e['size']
print (e)
Age
3 5.000000
4 4.000000
5 4.428571
6 4.125000
7 4.111111
8 5.375000
dtype: float64
<强>计时强>:
np.random.seed(45)
N = 100000
d = pd.DataFrame(np.random.randint(1000, size=(N, 2)), columns=['Age', 'ReadingAbility'])
#print (d)
In [30]: %timeit (d.groupby('Age').apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))
1 loop, best of 3: 1.27 s per loop
In [31]: %%timeit
...: c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
...: #print (c)
...: e = c.sub(c.sum())
...: e = e['sum'] / e['size']
...:
100 loops, best of 3: 6.28 ms per loop
答案 1 :(得分:1)
d.groupby("Age")['ReadingAbility'].mean()
获得每个组的平均值。您可以通过添加类似
的查询来过滤掉Age = 1
d.groupby("Age")['ReadingAbility'].mean().reset_index().query("Age != 1")
或
d.groupby("Age")['ReadingAbility'].mean().select(lambda x: x != 1, axis=0)
或者,正如Merkle Daamgard指出的那样,您先筛选出您不需要的值,然后groupby
和mean
。
d.query("Age != 1").groupby("Age")['ReadingAbility'].mean()
d.loc[d.Age != 1].groupby("Age")['ReadingAbility'].mean()
d.where(d.Age != 1).groupby("Age")['ReadingAbility'].mean()
有关详细信息,请参阅GroupBy.mean
。
答案 2 :(得分:0)
我认为你可以选择那个
df = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility'])
res = df.loc[df['Age'] != 1].groupby('Age').apply(lambda x: x['ReadingAbility'].mean())
print res
返回:
年龄: 2 3.5