如何从包含异常值数据的数据框中删除记录
在一列或多列中,与平均值相差3个标准值的值
示例:
row0 2 3 4 3
row1 2 3 4 3
row2 2 3 432 3
row3 2 3 4 3
我想删除 row2 ,因为值 [432] 。
谢谢。
答案 0 :(得分:0)
import numpy as np
import pandas as pd
data = np.array([['','Col1','Col2'],
['Row1',1,2],
['Row2',3,4]])
df= pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
#Convert to numeric
df1=df.apply(pd.to_numeric)
#Calculate the mean and STD
mean=df1.stack().mean()
std=df1.stack().std()
df1["Col3"]=mean+(std*3)
df1["Col4"]=mean-(std*3)
df1.Col3 = df1.Col3.astype(int)
df1.Col4 = df1.Col4.astype(int)
#See whether the values fall between the mean+(3*STD) and mean-(3*STD)
df1['Between1'] = (df1['Col1'] > df1['Col4']) & (df1['Col1'] < df1['Col3'])
df1['Between2'] = (df1['Col2'] > df1['Col4']) & (df1['Col2'] < df1['Col3'])
df1.head()
#Keep only the rows that are True
df1 = df1[df1['Between1'] == True]
df1 = df1[df1['Between2'] == True]
df1.head()