在CountVectorizer中使用`transform` vs.`fit_transform`的问题

时间:2018-08-24 13:05:21

标签: python python-3.x scikit-learn countvectorizer

我已经成功地使用CountVectorizer()训练和测试了Logistic回归模型:

def train_model(classifier, feature_vector_train, label):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    return classifier

def getPredictions (classifier, feature_vector_valid):    
    # predict the labels on validation dataset
    predict = classifier.predict(feature_vector_valid)

    return metrics.accuracy_score(predict, valid_y)

def createTrainingAndValidation(column):
    global train_x, valid_x, train_y, valid_y
    train_x, valid_x, train_y, valid_y = model_selection.train_test_split(finalDF[column], finalDF['DeedType1'])

def createCountVectorizer(column):
    global xtrain_count, xvalid_count
    # create a count vectorizer object 
    count_vect = CountVectorizer()
    count_vect.fit(finalDF[column])

    # transform the training and validation data using count vectorizer object
    xtrain_count =  count_vect.transform(train_x)
    xvalid_count =  count_vect.transform(valid_x)

createTrainingAndValidation('Test')
createCountVectorizer('Test')
classifier = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
predictions = getPredictions(classifier, xvalid_count)

我正在使用一个名为finalDF的DataFrame并带有所有带标签的文本。由于此模型的精度为0.68,因此我将在带有未知标签的DataFrame子集上对其进行测试。这没有包括在培训和测试阶段。我将训练好的模型保存为bestClassifier

现在我得到了未知文本的子集,并尝试执行以下操作:

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
text = unknownDf['Text']
xvalid_count =  count_vect.transform(text)

bestClassifier.predict(xvalid_count)

finalDF有800行,而unknownDf只有32行。我该如何纠正?

1 个答案:

答案 0 :(得分:2)

我想我看到发生了什么事,在这段代码中:

def createCountVectorizer(column):
    global xtrain_count, xvalid_count
    # create a count vectorizer object 
    count_vect = CountVectorizer()
    count_vect.fit(finalDF[column])

    # transform the training and validation data using count vectorizer object
    xtrain_count =  count_vect.transform(train_x)
    xvalid_count =  count_vect.transform(valid_x)

您要声明CountVectorizer(),先叫fit,然后再叫transform。您需要做的是,对CountVectorizer()使用相同的transformunknownDf['Text']

执行此操作时:

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
text = unknownDf['Text']
xvalid_count =  count_vect.transform(text)

您正在创建一个全新的CountVectorizer(),它将为unknownDf['Text']创建一个新的单词包,当您应该做的是删除这两行时

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])

,然后让您CountVectorizer()FIT上的现有finalDF[column]用于transform unknownDf['Text']

找到在声明为{{1}的CountVectorizer()createCountVectorizer(column) count_vect的{​​{1}}中使用transform的方法。