我已经训练了模型,在某些情况下无法解释答案。
我已经创建了玩具火车样本
makefile
我使用CourtIDGAS Addr_upd
03MS0001 usa, new-york, times square, 1
03MS0001 usa, new-york, times square, 3
03MS0001 usa, new-york, times square, 5
03MS0001 usa, new-york, times square, 7
03MS0001 usa, new-york, times square, 9
03MS0001 usa, new-york, times square, 2
03MS0001 usa, new-york, times square, 4
03MS0001 usa, new-york, times square, 6
03MS0001 usa, new-york, times square, 8
03MS0001 usa, new-york, times square, 10
03MS0001 usa, new-york, times square, 12
03MS0002 usa, new-york, times square, 11
03MS0002 usa, new-york, times square, 13
03MS0002 usa, new-york, times square, 14
03MS0002 usa, new-york, times square, 16
将文本转换为矢量,并使用CountVectorizer
预测地址的类别。
RidgeClassifier
当尝试根据火车样本预测水深时,我会得到正确的答案
但是,当我尝试使用其他数据(例如vec = CountVectorizer(token_pattern='(?u)\\b[а-яё0-9\/\-]+\\b', min_df=1)
X = vec.fit_transform(df.Addr_upd)
Y = df["CourtIDGAS"]
clf = RidgeClassifier(random_state=42)
clf.fit(X, y)
)进行预测时,我得到了类usa, new-york, times square, 18
。
我无法解释这一点,因为词汇表中的最大数字为16,但是在我看来,这个示例更接近03MS0001
。
如何解释该分类器的答案? 像这样处理这些数据的正确方法是什么?