我在git hub上找到了这个功能。
def std_div(data, threshold=3):
std = data.std()
mean = data.mean()
isOutlier = []
for val in data:
if val/std > threshold:
isOutlier.append(True)
else:
isOutlier.append(False)
return isOutlier
我想将其应用于每个组(dept)的dataFrame
employee_id dept Salary
1 sales 10000
2 sales 110000
3 sales 120000
4 hr 5000
5 hr 6000
这样可行,但它会计算整个数据框的std div。
df["std_div"]= df.from_dict(std_div(df.Salary))
答案 0 :(得分:1)
You could do something along the lines of the following, where you group by the column of interest then use a for loop to run the function on the column for that specific group
for name, group in df.groupby('dept'):
df.loc[group.index, 'outlier'] = std_div(group.Salary)
df
employee_id dept Salary outlier
1 sales 10000 False
2 sales 110000 False
3 sales 120000 False
4 hr 5000 True
5 hr 6000 True
Depending on what you would like that output to be, you can assign the return values to the original dataframe