给出一个字典:
data = {'18': [3.89, 1.28], '20': [1.39, 3.15], '15': [1.42, 3.10]}
我想挑选与18
中明显不同的项目。理想情况下,我会指定ALLOWED_DISCREPANCY
,将其设置为0.5
以进行演示,这是一个阈值,用于对哪些内容进行分类(与其余值相比)。
18
及其3.89
显然不在此处,因为大多数的值大约为1.4(比较每个列表中的任一值足以得出结论),差异(abs(3.89 - 1.4)
)是大于0.5
(允许的最大值)。
答案 0 :(得分:2)
如果你想要一种更加统计的方法来寻找异常值,你可以这样做:
data = {'18': [3.89, 1.28], '20': [1.39, 3.15], '15': [1.42, 3.10]}
avg = np.mean([x for sublist in data.values() for x in sublist])
stddev = np.std([x for sublist in data.values() for x in sublist])
对于一个标准偏差:
n_stddevs = 1
{k: [x for x in v if x >= avg-stddev*n_stddevs and x <= avg+stddev*n_stddevs] for k, v in data.items()}
# {'15': [1.42, 3.1], '18': [], '20': [1.39, 3.15]}
2:
n_stddevs = 2
{k: [x for x in v if x >= avg-stddev*n_stddevs and x <= avg+stddev*n_stddevs] for k, v in data.items()}
#{'15': [1.42, 3.1], '18': [3.89, 1.28], '20': [1.39, 3.15]}
对于0.5:
n_stddevs = 0.5
{k: [x for x in v if x >= avg-stddev*n_stddevs and x <= avg+stddev*n_stddevs] for k, v in data.items()}
# {'15': [], '18': [], '20': []}
答案 1 :(得分:1)
计算值的平均值。
>>> from numpy import mean
>>> data = {'18': [3.89, 1.28], '20': [1.39, 3.15], '15': [1.42, 3.10]}
>>> avg = mean([x for sublist in data.values() for x in sublist])
>>> avg
2.3716666666666666
设置阈值并构建一个新的字典,将原始键映射到与您的约束匹配的值列表。以下是两个例子:
>>> thresh = 0.5
>>> {k:[x for x in v if abs(x-avg) > thresh] for k, v in data.items()}
{'18': [3.89, 1.28], '15': [1.42, 3.1], '20': [1.39, 3.15]}
>>>
>>> thresh = 1
>>> {k:[x for x in v if abs(x-avg) > thresh] for k, v in data.items()}
{'18': [3.89, 1.28], '15': [], '20': []}
编辑:只考虑一个职位
>>> pos = 0
>>> {k:v[pos] for k, v in data.items() if abs(v[pos]-avg) > thresh}
{'18': 3.89}
>>>
>>> pos = 1
>>> {k:v[pos] for k, v in data.items() if abs(v[pos]-avg) > thresh}
{'18': 1.28}