根据分组的标准偏差检查单个记录

时间:2019-06-30 15:31:40

标签: python pandas pandas-groupby

数据:

                               Country        ITEM_TYPE   Ship Date  TOTAL_COST
0                                 Chad  Office Supplies   2/12/2011  2353920.64
1                               Latvia        Beverages   1/23/2016    34174.25
2                             Pakistan       Vegetables    2/1/2011   592408.95
3     Democratic Republic of the Congo        Household   10/6/2012  3861014.82
4                       Czech Republic        Beverages   12/5/2015   110978.89
5                         South Africa        Beverages   8/21/2012   314085.20
6                                 Laos       Vegetables   3/20/2011   438737.25
7                                China        Baby Food   5/12/2017   530868.60
8                              Eritrea             Meat   1/10/2015   886561.39
9                                Haiti  Office Supplies   7/20/2015  3253177.12
10                              Zambia           Cereal   8/24/2016    84787.64

对于该记录的特定TOTAL_COSTCOUNTRY组合,我想返回所有ITEM_TYPE大于2个标准差的记录。

我使用以下方法获得每种组合的标准偏差:

stdevs = data.groupby(['Country', 'ITEM_TYPE'])['TOTAL_COST'].std()

我的第一次尝试是:

results = data[data['Total Cost'] > 2*stdevs[data[['Country']][data['Item Type']]]]

使用的伪数据的URL(10,000条记录):http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

1 个答案:

答案 0 :(得分:0)

您可以在分组后使用transform对数据进行z得分归一化。

样本数据

import numpy as np
import pandas as pd

np.random.seed(123)
df = pd.DataFrame({'Country': list('AB')*20, 
                   'ITEM_TYPE': list('1122')*10,
                   'TOTAL_COST': np.random.randint(1, 500, 40)})

代码

gp = df.groupby(['Country', 'ITEM_TYPE'])
df['zscore'] = (df['TOTAL_COST'] - gp.TOTAL_COST.transform('mean'))/gp.TOTAL_COST.transform('std')

# Then filter
df[df.zscore.abs().gt(2)]
#  Country ITEM_TYPE  TOTAL_COST    zscore
#1       B         1         383  2.127847