scikit-learn KMeans to OneHot正在筹备中?

时间:2017-03-24 13:38:00

标签: scikit-learn k-means pipeline

我希望使用scikit-learn的KMeans将一组变量聚类到K个bin中,然后使用OneHotEncoder对列进行二值化。我想在管道中使用此功能,但我认为我遇到了问题,因为KMeans使用fit_predict()方法返回课程,而不是fit_transform()

以下是一些示例代码:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

foo = np.random.randn(100, 50)

km = KMeans(3)
ohe = OneHotEncoder()

bar = km.fit_predict(foo)
ohe.fit_transform(bar.reshape(-1, 1))

返回预期的100x3矩阵:

<100x3 sparse matrix of type '<class 'numpy.float64'>'
    with 100 stored elements in Compressed Sparse Row format>

如果我把KMeans放在管道中:

pipeline = Pipeline([
    ('kmeans', KMeans(3))
])

pipeline.fit_predict(foo)

它返回非二值化类:

array([1, 2, 2, 0, ... , 1])

但是,如果我同时使用KMeans和OneHotEncoder,KMeans将其fit_transform()方法提供给OneHotEncoder,其中&#34;将X转换为簇距离空间&#34;:

pipeline = Pipeline([
    ('cluster', KMeans(5)),
    ('one_hot', OneHotEncoder())
])

pipeline.fit_transform(foo)

它返回所有线性距离one-hot encoded和100x25 array:

<100x25 sparse matrix of type '<class 'numpy.float64'>'
    with 500 stored elements in Compressed Sparse Row format>

然后我决定尝试使用KMeans创建一个子管道,因为我的理解是管道中间不能有fit_predict()方法。这也行不通:

pipeline = Pipeline([
    ('cluster', Pipeline([
        ('kmeans', KMeans(5))
    ])),
    ('one_hot', OneHotEncoder())
])

pipeline.fit_transform(foo)

返回同样的事情:

<100x25 sparse matrix of type '<class 'numpy.float64'>'
    with 500 stored elements in Compressed Sparse Row format>

所以现在我不知道如何让这种程序流程起作用。有什么建议吗?

修改

所以我通过从KMeans创建一个新类并重新定义fit_transform()找到了解决方法。另外还发现我应该使用LabelBinarizer()而不是OneHotEncoder()

class KMeans_foo(KMeans):
    def fit_transform(self, X, y=None):
        return self.fit_predict(X)

pipeline = Pipeline([
    ('cluster', KMeans_foo(3)),
    ('binarize', LabelBinarizer())
])

pipeline.fit_transform(foo)

返回:

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       ..., 
       [0, 1, 0]])

EDIT2:

找到一个清洁工&#34;为任何sklearn模型创建包装器的方法,您希望将predict方法的输出用作中间步骤:

class ModelTransformer(TransformerMixin):
    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return pd.DataFrame(self.model.predict(X))

pipeline = Pipeline([
    ('cluster', ModelTransformer(KMeans_foo(3))),
    ('binarize', LabelBinarizer())
])

0 个答案:

没有答案