预测具有分类功能

时间:2018-02-21 08:43:50

标签: python pandas machine-learning keras

我正在使用Keras构建二进制分类模型。 Dataset包含许多分类功能(IP地址,目标号码,目标地址,用户代理等)

我无法提交预测,因为功能属于分类,培训和测试数据的列数与预测不同。

  File "/Users/spicyramen/Documents/Development/Python/gl-env/lib/python2.7/site-packages/keras/models.py", line 1006, in predict
    return self.model.predict(x, batch_size=batch_size, verbose=verbose)
  File "/Users/spicyramen/Documents/Development/Python/gl-env/lib/python2.7/site-packages/keras/engine/training.py", line 1772, in predict
    check_batch_axis=False)
  File "/Users/spicyramen/Documents/Development/Python/gl-env/lib/python2.7/site-packages/keras/engine/training.py", line 153, in _standardize_input_data
    str(array.shape))
ValueError: Error when checking : expected dense_1_input to have shape (None, 2134) but got array with shape (34, 102)

我能够分割数据和训练模型。

ruri                object
ruri_user           object
ruri_domain         object
from_user           object
from_domain         object
from_tag            object
to_user             object
contact_user        object
callid              object
content_type        object
user_agent          object
source_ip           object
source_port          int64
destination_port     int64
contact_ip          object
contact_port         int64
toll_fraud           int64

这是我的逻辑:

  • 从CSV导入数据
  • 删除不需要的列
  • 生成虚拟列(encode_one_hot
  • 将数据集拆分为训练和测试数据。
  • 火车模型
  • 评估
  • 提交预测< - 失败

这是我的code

培训和测试尺寸

Samples Columns
1665 2134
555  2134

功能:

def preproc_test(self):
        """Pre-process testing data."""

        #Import data
        test = self.import_data(self.test_fn, drop=True)
        # Extract labels.
        labels = test.user_agent.values
        # Fix NA values.
        test = self.fix_na(test)

        # Feature Engineering
        #test = self.engineer_features(test)

        # Create dummy variables.
        test = encode_one_hot(test, 'ruri_user')
        test = encode_one_hot(test, 'from_user')
        test = encode_one_hot(test, 'from_domain')
        test = encode_one_hot(test, 'to_user')
        test = encode_one_hot(test, 'contact_user')
        test = encode_one_hot(test, 'user_agent')
        test = encode_one_hot(test, 'source_ip')
        test = encode_one_hot(test, 'contact_ip')
        return labels, test


def prepare_submission(self, name):
        labels, test_data = self.preproc_test()
        predictions = self.model.predict(test_data)
        subm = pd.DataFrame(np.column_stack([labels, np.around(predictions[:, 1])]).astype('int32'),
                            columns=['user_agent', 'toll_fraud'])
        subm.to_csv('%s.csv' % name, index=False)
        return subm

原始issue

不确定我是否应该将我的预测调整为相同数量的原始功能/列,如果是,那么最佳方法是什么?

0 个答案:

没有答案