Question

我正在尝试根据对象的各种功能创建推荐器（例如：类别，标签，作者，标题，视图，共享等）。如您所见，这些功能属于混合类型，我也没有任何特定于用户的数据。显示其中一个对象的详细信息后，我想显示另外3个相似的对象。我正在尝试将kNN与sklearn一起使用，并发现单热编码在这种情况下很有用。但我不知道如何将它们与KNN一起使用。欢迎任何帮助，即使是完全不同的图书馆或方法。我是ML的新手。

Answer 1

我假设您已经清理了数据并将其存储在pandas.DataFrame或其他类似阵列的结构中。在这一步你会做

import pandas as pd

# Retrieve and clean your data.
# Store it in an object df

df_OHE = pd.get_dummies(df)

# At this stage you will want to rescale your variable to bring them to a similar numeric range
# This is particularly important for KNN, as it uses a distance metric
from sklearn.preprocessing import StandardScaler
df_OHE_scaled = StandardScaler().fit_transform(df_OHE)

# Now you are all set to use these data to fit a KNN classifier.

见pd.get_dummies() doc。并且this discussion解释了KNN缩放的必要性。请注意，您可以在sklearn中尝试其他类型的缩放器。

P.S。我假设您对python中的解决方案感兴趣，因为您提到了那些特定的包。

Answer 2

查看Pipeline界面和this好的介绍。管道是一种使用模型和超参数选择组织预处理的简洁方法。

我的基本设置如下：

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

numeric = [list of numeric column names]
categorical = [list of categorical column names]

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
    ])),
    ('model', KNeighborsClassifier())
])

这允许您简单地尝试不同的分类器，功能变换器（例如MinMaxScaler（）而不是StandardScaler（）），即使在与分类器超参数一起进行大网格搜索时也是如此。

在使用sklearn或pandas进行一次热编码后，如何在混合数据集（数值+分类）上应用KNN

2 个答案: