我有一个包含13个功能和1000万行的数据库。我想应用k均值来消除任何异常。尽管我要应用k均值,然后创建一个新列,并在数据点和聚类质心之间建立一个距离,然后创建一个新列,并在其中包含平均距离,如果该距离大于平均距离,则删除整行。但是看来我编写的代码无法正常工作。
数据集示例: https://drive.google.com/open?id=1iB1qjnWQyvoKuN_Pa8Xk4BySzXVTwtUk
df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)
# Dropping columns with low feature importance
del df['AmbTemp_DegC']
del df['NacelleOrientation_Deg']
del df['MeasuredYawError']
#applying kmeans
#applying kmeans
kmeans = KMeans( n_clusters=8)
clusters= kmeans.fit_predict(df)
centroids = kmeans.cluster_centers_
distance1 = kmeans.fit_transform(df)
distance2 = distance1.mean()
df['distances']=distance1-distance2
df = df[df['distances'] >=0]
del df['distances']
df.to_csv('/content//drive/My Drive/K TEST.csv', index=False)
错误:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'distances'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
9 frames
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'distances'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
126 raise ValueError(
127 "Wrong number of items passed {val}, placement implies "
--> 128 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs))
129 )
130
ValueError: Wrong number of items passed 8, placement implies 1
谢谢
答案 0 :(得分:1)
您:我想应用k均值来消除任何异常。
实际上,KMeas将检测异常并将其包含在最近的群集中。损失函数是从每个点到其分配的聚类质心的最小平方距离之和。如果要消除异常值,请考虑使用z得分方法。
import numpy as np
import pandas as pd
# import your data
df = pd.read_csv('C:\\Users\\your_file.csv)
# get only numerics
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
df = newdf
# count rows in DF before kicking out records with z-score over 3
df.shape
# handle NANs
df = df.fillna(0)
from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
df.shape
df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
# count rows in DF before kicking out records with z-score over 3
df.shape
此外,在您有空闲时间时,请查看这些链接。
答案 1 :(得分:1)
这是您最后一个问题的后续解答。
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()
titanic['age'].plot.hist(
bins = 50,
title = "Histogram of the age variable"
)
from scipy.stats import zscore
titanic["age_zscore"] = zscore(titanic["age"])
titanic["is_outlier"] = titanic["age_zscore"].apply(
lambda x: x <= -2.5 or x >= 2.5
)
titanic[titanic["is_outlier"]]
ageAndFare = titanic[["age", "fare"]]
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(ageAndFare)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
ageAndFare.plot.scatter(
x = "age",
y = "fare",
c = clusters,
cmap = cmap,
colorbar = False
)
有关所有详细信息,请参见此链接。
https://www.mikulskibartosz.name/outlier-detection-with-scikit-learn/
今天之前,我从未听说过“本地异常因素”。当我用Google搜索它时,我得到了一些信息,似乎表明它是DBSCAN的派生产品。最后,我认为我的第一个答案实际上是检测异常值的最佳方法。 DBSCAN正在聚类算法,以发现异常值,这些异常值实际上被认为是“噪声”。我认为DBSCAN的主要目的不是异常检测,而是集群。总之,正确选择超参数需要一些技巧。同样,DBSCAN在非常大的数据集上可能会变慢,因为它隐式需要计算每个采样点的经验密度,从而导致二次最坏情况下的时间复杂度,这在大型数据集上非常慢。