我有一个如下所示的数据集(这里仅显示了6行)。我有700行,我想将其分类为9887、8413等类别。问题在于该模型还对完全无关的输入进行了分类,例如“圣诞树”,“新年”,“耐克是一项很好的运动品牌”,分为给定的9887或8413等类别之一,等等。当输入完全不相关或为空“”时,我希望将其归类为0000。
personInfo personID
alicia is from unitedStates 9887
alicia likes to do Yoga 9887
cooking is one of the hobby of alicia 9887
sam is from Brazil 8413
sam father is a doctor 8413
In free time, sam prefers hiking 8413
我的代码:
X_train, X_test, y_train, y_test = train_test_split(df['personInfo'], df['personId'], random_state = 0)
count_vect = CountVectorizer().fit(X_train)
X_train_counts = count_vect.transform(X_train)
tfidf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
classificationModel = LinearSVC().fit(X_train_tfidf, y_train)
filename = 'finalized_model.sav'
pickle.dump(classificationModel, open(filename, 'wb'))
#data_to_be_predicted="alicia has a sister in texas"
filename = 'finalized_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.predict(count_vect.transform([data_to_be_predicted]))
print(result)
这里是一个输入输出:
input: "alicia has a sister in texas"
output: 9887
现在在下面的输入中,我希望将模型归类为0000,因为它不相关,但是将其归类为9887或8413或其他给定类别
input: "christmas tree"
expected output: 0000
input: " "
expected output: 0000