数据:
Country ITEM_TYPE Ship Date TOTAL_COST
0 Chad Office Supplies 2/12/2011 2353920.64
1 Latvia Beverages 1/23/2016 34174.25
2 Pakistan Vegetables 2/1/2011 592408.95
3 Democratic Republic of the Congo Household 10/6/2012 3861014.82
4 Czech Republic Beverages 12/5/2015 110978.89
5 South Africa Beverages 8/21/2012 314085.20
6 Laos Vegetables 3/20/2011 438737.25
7 China Baby Food 5/12/2017 530868.60
8 Eritrea Meat 1/10/2015 886561.39
9 Haiti Office Supplies 7/20/2015 3253177.12
10 Zambia Cereal 8/24/2016 84787.64
对于该记录的特定TOTAL_COST
和COUNTRY
组合,我想返回所有ITEM_TYPE
大于2个标准差的记录。
我使用以下方法获得每种组合的标准偏差:
stdevs = data.groupby(['Country', 'ITEM_TYPE'])['TOTAL_COST'].std()
我的第一次尝试是:
results = data[data['Total Cost'] > 2*stdevs[data[['Country']][data['Item Type']]]]
使用的伪数据的URL(10,000条记录):http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
答案 0 :(得分:0)
您可以在分组后使用transform
对数据进行z得分归一化。
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({'Country': list('AB')*20,
'ITEM_TYPE': list('1122')*10,
'TOTAL_COST': np.random.randint(1, 500, 40)})
gp = df.groupby(['Country', 'ITEM_TYPE'])
df['zscore'] = (df['TOTAL_COST'] - gp.TOTAL_COST.transform('mean'))/gp.TOTAL_COST.transform('std')
# Then filter
df[df.zscore.abs().gt(2)]
# Country ITEM_TYPE TOTAL_COST zscore
#1 B 1 383 2.127847