Question

我有一个包含13个功能和1000万行的数据库。我想应用k均值来消除任何异常。尽管我要应用k均值，然后创建一个新列，并在数据点和聚类质心之间建立一个距离，然后创建一个新列，并在其中包含平均距离，如果该距离大于平均距离，则删除整行。但是看来我编写的代码无法正常工作。

数据集示例： https://drive.google.com/open?id=1iB1qjnWQyvoKuN_Pa8Xk4BySzXVTwtUk

df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)


# Dropping columns with low feature importance
del df['AmbTemp_DegC']
del df['NacelleOrientation_Deg']
del df['MeasuredYawError']



#applying kmeans
#applying kmeans
kmeans = KMeans( n_clusters=8)


clusters= kmeans.fit_predict(df)

centroids = kmeans.cluster_centers_

distance1 = kmeans.fit_transform(df)

distance2 = distance1.mean()

df['distances']=distance1-distance2

df = df[df['distances'] >=0]


del df['distances']

df.to_csv('/content//drive/My Drive/K TEST.csv', index=False)

错误：

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'distances'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
9 frames
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'distances'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
    126             raise ValueError(
    127                 "Wrong number of items passed {val}, placement implies "
--> 128                 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs))
    129             )
    130 

ValueError: Wrong number of items passed 8, placement implies 1

谢谢

Answer 1

您：我想应用k均值来消除任何异常。

实际上，KMeas将检测异常并将其包含在最近的群集中。损失函数是从每个点到其分配的聚类质心的最小平方距离之和。如果要消除异常值，请考虑使用z得分方法。

import numpy as np
import pandas as pd

# import your data
df = pd.read_csv('C:\\Users\\your_file.csv)

# get only numerics
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)

df = newdf

# count rows in DF before kicking out records with z-score over 3
df.shape

# handle NANs
df = df.fillna(0)


from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
df.shape


df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

# count rows in DF before kicking out records with z-score over 3
df.shape

此外，在您有空闲时间时，请查看这些链接。

https://medium.com/analytics-vidhya/effect-of-outliers-on-k-means-algorithm-using-python-7ba85821ea23

https://statisticsbyjim.com/basics/outliers/

Answer 2

这是您最后一个问题的后续解答。

import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()
titanic['age'].plot.hist(
  bins = 50,
  title = "Histogram of the age variable"
)


from scipy.stats import zscore
titanic["age_zscore"] = zscore(titanic["age"])
titanic["is_outlier"] = titanic["age_zscore"].apply(
  lambda x: x <= -2.5 or x >= 2.5
)
titanic[titanic["is_outlier"]]


ageAndFare = titanic[["age", "fare"]]
ageAndFare.plot.scatter(x = "age", y = "fare")


from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
ageAndFare.plot.scatter(x = "age", y = "fare")


from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
  eps = 0.5,
  metric="euclidean",
  min_samples = 3,
  n_jobs = -1)
clusters = outlier_detection.fit_predict(ageAndFare)

clusters


from matplotlib import cm
cmap = cm.get_cmap('Accent')
ageAndFare.plot.scatter(
  x = "age",
  y = "fare",
  c = clusters,
  cmap = cmap,
  colorbar = False
)

有关所有详细信息，请参见此链接。

https://www.mikulskibartosz.name/outlier-detection-with-scikit-learn/

今天之前，我从未听说过“本地异常因素”。当我用Google搜索它时，我得到了一些信息，似乎表明它是DBSCAN的派生产品。最后，我认为我的第一个答案实际上是检测异常值的最佳方法。 DBSCAN正在聚类算法，以发现异常值，这些异常值实际上被认为是“噪声”。我认为DBSCAN的主要目的不是异常检测，而是集群。总之，正确选择超参数需要一些技巧。同样，DBSCAN在非常大的数据集上可能会变慢，因为它隐式需要计算每个采样点的经验密度，从而导致二次最坏情况下的时间复杂度，这在大型数据集上非常慢。

查找K均值距离

2 个答案: