删除统计异常值的更好方法是什么?

时间:2012-07-07 00:11:18

标签: python statistics

此代码有效。但我不禁觉得这是一个黑客,尤其是“抵消”部分。我不得不把它放在那里因为否则每次执行del操作时删除中的所有索引值都会移动一次。

    # remove outliers > devs # of std deviations
    devs = 1
    deletes = []
    for num, duration in enumerate(durations):
        if (duration > (mean_duration + (devs * std_dev_one_test))) or \
            (duration < (mean_duration - (devs * std_dev_one_test))):
            deletes.append(num)
    offset = 0
    for delete in deletes:
        del durations[delete - offset]
        del dates[delete - offset]
        offset += 1

关于如何让它变得更好的想法?

4 个答案:

答案 0 :(得分:4)

在迭代列表时构建一个守护者列表:

def isKeeper( duration ):
    if (duration > (mean_duration + (devs * std_dev_one_test))) or \
            (duration < (mean_duration - (devs * std_dev_one_test))):
        return False
    return True

durations = [duration for duration in durations if isKeeper(duration)]

答案 1 :(得分:3)

也许是这样的:

import numpy as np        

myList = [1,2,3,4,5,6,7,3,4,5,3,5,99] 

mean_duration  = np.mean(myList)
std_dev_one_test = np.std(myList)     

def drop_outliers(x):
    if abs(x - mean_duration) <= std_dev_one_test:
        return x

myList = filter(drop_outliers, myList)

结果:

>>> myList
[1, 2, 3, 4, 5, 6, 7, 3, 4, 5, 3, 5]

答案 2 :(得分:1)

您是否正在从列表中删除项目并导致索引转移并且您正在使用偏移进行补偿?

如果是这种情况,那么只需从后面删除,这样当你删除项目时它就不会影响列表的其余部分。

因此,开始从最后一项迭代到列表前面。

这些SO问题可能会引起您的兴趣Delete many elements of list (python)Python: Removing list element while iterating over list

另一个好的SO讨论可以在这里找到:Remove items from a list while iterating(感谢@PaulMcGuire通过评论提出建议)

答案 3 :(得分:0)

如果您的数据集很小,您只需反转逻辑,并保留值而不是删除它们:

# keep value outliers < devs # of std deviations
devs = 1
keeps = []
for duration in durations:
    if (duration <= (mean_duration + (devs * std_dev_one_test))) and \
        (duration >= (mean_duration - (devs * std_dev_one_test))):
        keeps.append(duration)