假设我有一个包含n个类别/标签的功能计数列表,例如:
feature 1,label1 = 10 # word, label = frequency count
feature 1,label2 = 0
feature 2,label1 = 3
feature 2,label2 = 0
如果是json,对于“坏”和“好”这两个词来说就是这样:
{
"bad": {"pos": 1, "neg": 15, "neu": 2},
"good": {"pos": 13, "neg": 3, "neu": 2},
}
这是存档的,是从旧的应用程序继承的(我无法访问原始文档,长篇故事),但它们是相关的,我想使用它们。这个应用程序是一个情绪分类应用程序,可以获取报纸评论并对其进行分类,与我想要开发的相同。
那么,我如何将这些计数提供给Tf-df Vectorizer或CountVectorizer 或将它们与运行矢量化器获得的结果合并,即使用以下代码中的X_train_count:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> data_train = {"data": ["ola good", "hey good", "good", "good", "bad", "bad", "bad"], "target":[1,1,1,1,0,0,0]}
>>> X_train_count = count_vect.fit_transform(data_train["data"])
>>> count_vect.get_feature_names()
[u'bad', u'good']
>>> print X_train_count
(0, 1) 1
(1, 1) 1
(2, 1) 1
(3, 1) 1
(4, 0) 1
(5, 0) 1
(6, 0) 1
感谢您的帮助!