我必须处理一些包含许多异常值的数据集。为了解决此问题,我尝试了scipy.stats.zscore包中的zscore方法。我注意到zscore在斜率接近零的数据上表现良好,但在this之类的数据上却失败(这只是zscore应用于数据集的一列。所有其他列的行为类似) 。我想知道您是否有任何建议可以有效地删除这些数据(使用阈值作为标准),甚至更好,可以根据上一个和下一个数据点将它们替换为平均值。任何意见将不胜感激。预先感谢!
P.S。在这里,我向您介绍上图的列:
arr = np.array([8025.50488724, 8377.54637165, 8003.04448708, 8014.59457554,
8076.68641539, 8061.77025624, 8034.39841382, 8064.43972533,
8106.22354301, 8004.22617243, 8098.79430648, 8039.45244548,
8130.59272478, 8023.66620593, 8029.40399658, 8048.02549378,
8128.23559689, 7698.61077712, 8296.84184421, 8120.05000933,
8076.21439291, 8089.74958136, 8049.60475816, 8099.94072516,
8098.21972018, 8041.37988273, 8075.29784199, 8083.77079629,
8053.10370429, 8060.97668291, 8073.54578926, 8112.63856539,
8061.07610198, 8117.06288525, 8123.46424527, 7732.44228884,
7871.12824018, 8384.18892692, 8268.14661269, 8160.29729536,
8101.1124525 , 8102.00789897, 8106.71447608, 8200.3972452 ,
8157.40847494, 8155.20875575, 8105.91888192, 8139.91621857,
8208.55394513, 8153.79003229, 8208.64688519, 8176.20207854,
8116.57474558, 7851.9089821 , 8166.16732609, 8180.5166732 ,
8132.98596211, 8214.70668611, 8179.96525835, 8177.22001891,
8232.6465354 , 8219.33633614, 8132.86334991, 8123.82362545,
8205.56532738, 8169.12244837, 8166.82326228, 8173.26679646,
8160.23044661, 8180.13612851, 8174.81752165, 8210.49493436,
8214.85167436, 8255.91104396, 8215.65510485, 8173.86399449,
8175.68440431, 8222.20252751, 8248.22775749, 8316.28079657,
8208.68546766, 8368.15505505, 8298.21876447, 8255.23460166,
8234.95006346, 8206.85161334, 8271.18830895, 8264.64203939,
8275.19502371, 8260.9065879 , 8279.82303054, 8289.21328844,
8295.48813738, 8563.80075054, 8240.83179332, 8254.28919325,
8287.30553475, 8227.05404824, 8232.75123101, 8251.94776222,
8353.5107826 , 8304.55042927, 8264.06358987, 8265.42794629,
8340.13966806, 8334.66528637, 8531.29337395, 8398.74657029,
8312.50125701, 8276.1570648 , 8308.18320714, 8319.27906188,
8322.35162962, 8280.17460496, 8303.5931151 , 8478.95653878,
8591.45900298, 8394.93401816, 8413.80146216, 8344.67340526,
8379.0377189 , 8385.07964767, 8335.36651436, 8543.13704241,
8575.70560223, 8422.63839007, 8337.19361951, 8323.36171043,
8339.07277296, 8365.99533151, 8367.12965552, 8371.4433277 ,
8391.96049944, 8430.36716456, 8396.33063144, 8390.97665384,
8426.37199761, 8466.03265082, 8344.592655 , 8345.7621689 ,
8670.30946115, 8589.57966898, 8562.24372092, 8384.73158696,
8466.40966225, 8430.39344979, 8376.40974176, 8402.07626595,
8416.13159741, 8410.84375887, 8426.88826807, 8409.26272352,
8402.09544067, 8395.04502637, 8481.20458213, 8423.98201359,
8401.20516208, 8420.42737741, 8644.28546585, 8802.2026103 ,
8623.76851219, 8499.20251524, 8467.37125462, 8499.8916737 ,
8455.41339613, 8498.66957617, 8538.80582528, 8526.61485012,
8455.01056554, 8475.76698661, 8527.44941769, 8490.99847618,
8596.10795533, 8499.38078658, 8505.70999169, 9054.22196265,
8904.00118577, 9137.39213267, 8730.98719259, 8449.36357596,
8450.72010796, 8516.71144121, 8520.67283196, 8518.56975672,
8462.25313419, 8476.36308039, 8520.50808048, 8464.08646344,
8475.37011255, 8541.24342616, 8467.39153078, 8513.82941226,
8990.16196681, 8865.94673585, 8681.26204299, 8724.46278448,
8710.26882726, 8468.98507413, 8459.16286692, 8521.39279004])
答案 0 :(得分:1)
您可能想尝试scipy.signal.medfilt,它是一个中值过滤器,它几乎完全符合您的描述。我对您的数据进行了测试,内核大小为11很好。您也可以尝试scipy.ndimage.filters.uniform_filter1d这是一个均值过滤器,但我认为中值过滤器将对您最有效。
以下是显示均值和中值滤波器的内核大小为11的图形:
编辑:这应该得到我想要的。我不确定如何无循环执行此操作,对于较大的值,这可能会变慢,但我认为它应该可以工作。
threshold = 100
kernel_size = 11
median = medfilt(arr, kernel_size=kernel_size)
diff = np.absolute(arr - median)
new_data = np.zeros(np.shape(arr))
for i in range(len(diff)):
if diff[i] > threshold:
new_data[i] = median[i]
else:
new_data[i] = arr[i]