Question

我在git hub上找到了这个功能。

def std_div(data, threshold=3):
    std = data.std()
    mean = data.mean()
    isOutlier = []
    for val in data:
        if val/std > threshold:
            isOutlier.append(True)
        else:
            isOutlier.append(False)
    return isOutlier

我想将其应用于每个组（dept）的dataFrame

     employee_id  dept            Salary
      1             sales           10000
      2             sales           110000 
      3             sales           120000
      4             hr              5000
      5             hr              6000

这样可行，但它会计算整个数据框的std div。

df["std_div"]= df.from_dict(std_div(df.Salary))

Answer 1

You could do something along the lines of the following, where you group by the column of interest then use a for loop to run the function on the column for that specific group

for name, group in df.groupby('dept'):
    df.loc[group.index, 'outlier'] = std_div(group.Salary)

df
employee_id dept    Salary  outlier
1           sales   10000   False
2           sales   110000  False
3           sales   120000  False
4           hr      5000    True
5           hr      6000    True

Depending on what you would like that output to be, you can assign the return values to the original dataframe

如何用group by将函数应用于熊猫数据帧

1 个答案: