我们说我有一个数据框:
d = pd.DataFrame({'Salary' : pd.Series([1, 20000, 5, 1000, 3000],
index = ['Joe', 'Steph', 'Jared', 'Oliver', 'Gaby']),
'Sex' : pd.Series([0, 1, 0, 0, 1],
index=['Joe', 'Steph', 'Jared', 'Oliver', 'Gaby'])})
Salary Sex
Joe 1 0
Steph 20000 1
Jared 5 0
Oliver 7000 0
Gaby 3000 1
我编写了一个以列名作为参数的函数,计算值的四分位数范围并根据该值返回异常值的数量。如果我还希望该功能能够返回那些有异常工资的女性人数,那么我该如何获得性别'列为了检查相应的性别'离群薪水的价值?
这是我的异常函数:
def iqr_outliers(num_df, column):
nan_count = 0
for value in column:
if value == 'NaN':
nan_count += 1
cleaned_column = [x for x in column if str(x) != "NaN"]
iqr = np.subtract(*np.percentile(cleaned_column, [75, 25]))
upper = np.percentile(cleaned_column, 75) + 1.5 * iqr
lower = np.percentile(cleaned_column, 25) - 1.5 * iqr
outliers = []
lows = 0
highs = 0
fem_outliers= 0
for value in cleaned_column:
if value < lower:
lows += 1
outliers.append(value)
elif value > upper:
highs += 1
outliers.append(value)
return ({"Number of low outliers": lows, "Number of high outliers": highs, "Number of NaNs": nan_count})
在那些if语句的某处,我想看看&#39; sex&#39;对于同一行,但我真的不知道如何访问它。
答案 0 :(得分:1)
注意,您可以使用--binary
计算四分位数范围:
percentile
然后,您可以使用elementwise布尔运算:
In [21]: d
Out[21]:
Salary Sex
Joe 1 0
Steph 20000 1
Jared 5 0
Oliver 1000 0
Gaby 3000 1
In [22]: iqr = d.Salary.quantile([.25,.75]).values
In [23]: iqr
Out[23]: array([ 5., 3000.])
最后,您可以将结果用于整个数据框的选择:
In [24]: (d.Salary < iqr[0]) | (d.Salary > iqr[1])
Out[24]:
Joe True
Steph True
Jared False
Oliver False
Gaby False
Name: Salary, dtype: bool
或者那种效果。我不记得手中的Tukey异常值的细节。但采用上述方法应该很容易处理。
In [26]: d[(d.Salary < iqr[0]) | (d.Salary > iqr[1])]
Out[26]:
Salary Sex
Joe 1 0
Steph 20000 1
要获得女性人数,您可以使用In [40]: IQR = iqr[1] - iqr[0]
In [41]: upper = 1.5*IQR+iqr[1]
In [42]: lower = iqr[0] - 1.5*IQR
In [43]: (d.Salary < lower) | (d.Salary > upper)
Out[43]:
Joe False
Steph True
Jared False
Oliver False
Gaby False
Name: Salary, dtype: bool
In [44]: d[(d.Salary < lower) | (d.Salary > upper)]
Out[44]:
Salary Sex
Steph 20000 1
sum