Question

我正在尝试构建二进制分类器，我的大多数变量都是分类的。因此，我想将分类数据处理成虚拟变量。我有以下数据集：

ruri                object
ruri_user           object
ruri_domain         object
from_user           object
from_domain         object
from_tag            object
to_user             object
contact_user        object
callid              object
content_type        object
user_agent          object
source_ip           object
source_port          int64
destination_port     int64
contact_ip          object
contact_port         int64
toll_fraud           int64

我只会选择16个中的10个功能：

def select_features(self, data):
        """Selects the features that we'll use in the model. Drops unused features"""
        features = ['ruri', 
                    'ruri_user', 
                    'ruri_domain', 
                    'from_user', 
                    'from_domain', 
                    'from_tag', 
                    'to_user',
                    'contact_user', 
                    'callid', 
                    'content_type', 
                    'user_agent', 
                    'source_ip', 
                    'source_port',
                    'destination_port', 
                    'contact_ip', 
                    'contact_port']
        dropped_features = ['ruri', 'ruri_domain', 'callid', 'from_tag', 'content_type', 'from_user']
        target = ['toll_fraud']
        X = data[features].drop(dropped_features, axis=1)
        y = data[target]
        return X, y

我将数据集拆分为训练和测试数据。最初两个子集具有相同数量的特征，在将我的特征转换为分类后，我的变量数量发生变化，因此无法处理模型。

在create_dummies之前：

1665 10
555 10

在create_dummies之后：

1665 1564
555 765

我在这里制作假人：

def create_dummies(self, data, cat_vars, cat_types):
        """Processes categorical data into dummy vars."""

        cat_data = data[cat_vars].values
        for i in range(len(cat_vars)):
            bins = LabelBinarizer().fit_transform(cat_data[:, 0].astype(cat_types[i]))
            cat_data = np.delete(cat_data, 0, axis=1)
            cat_data = np.column_stack((cat_data, bins))
        return cat_data


def preproc(self):
        """Executes the full preprocessing pipeline."""

        # Import Data & Split.
        X_train_, y_train, X_valid_, y_valid = self.import_and_split_data()
        # Fill NAs.
        X_train, X_valid = self.fix_na(X_train_), self.fix_na(X_valid_)
        # Preproc Categorical Vars
        cat_vars = ['ruri_user',
                    'from_domain',
                    'to_user',
                    'contact_user',
                    'user_agent',
                    'source_ip',
                    'contact_ip']

        cat_types = ['str', 'str', 'str', 'str', 'str', 'str', 'str']
        print 'Before create_dummies'
        print X_train.shape[0], X_train.shape[1]
        print X_valid.shape[0], X_valid.shape[1]

        X_train_cat, X_valid_cat = self.create_dummies(X_train, cat_vars, cat_types), self.create_dummies(X_valid,
                                                                                                          cat_vars,
                                                                                                          cat_types)

        print 'After create_dummies'
        print X_train_cat.shape[0], X_train_cat.shape[1]
        print X_valid_cat.shape[0], X_valid_cat.shape[1]

        X_train, X_valid = X_train_cat, X_valid_cat
        print 'After assignment'
        print X_train.shape[0], X_train.shape[1]
        print X_valid.shape[0], X_valid.shape[1]

        return X_train.astype('float32'), y_train.values, X_valid.astype('float32'), y_valid.values

完整代码here

数据集here

来自here的原始代码

Answer 1

当您将数据帧拆分为训练集和测试集时，某些类别会进入训练集而不是测试集中，这就是为什么您会变得不同你的火车和测试仪的形状！

正如评论中所建议的那样，您需要在分割成训练集和测试集之前进行所有预处理。不需要单独进行火车预处理和测试。

您将获得编码的所有可能类别，然后您可以拆分

为分类数据创建虚拟对象

1 个答案: