我有一个词汇表['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph']
和一个句子["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"]
列表。
我使用“ sklearn”中的“ CountVectorizer”将句子拟合为基于八个词的稀疏矩阵。我在下面得到一个输出。
[[0 0 0 0 0 1 0 1]
[0 0 0 0 1 0 0 0]
[0 0 0 0 1 0 0 1]
[0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0]]
现在,我正在尝试找出矩阵中这8个单词的顺序。任何帮助将不胜感激。
答案 0 :(得分:0)
CountVectorizer默认情况下使用小写字母,因此“人类”,“图形”,“ ESP”不匹配。看来词汇向量在您的结果中是以某种方式排序的。
您可以设置小写= False。
lowercaseboolean,默认为True将所有字符转换为小写 在标记化之前。 sclearn doc
我确实喜欢这个。
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"
]
voc = ['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph']
vectorizer = CountVectorizer(vocabulary=voc, lowercase=False)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
# ['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph']
# [[1 1 1 0 0 0 0 0]
# [0 0 0 0 0 0 1 0]
# [0 1 0 0 0 0 1 0]
# [0 0 0 0 0 0 0 0]
# [0 0 0 1 0 0 0 0]
# [0 0 0 0 0 0 0 0]
# [0 0 0 0 1 0 0 1]
# [0 0 0 0 1 0 0 1]]
在矩阵中,每行都是一个句子的语音匹配。因此,“ Human”,“ interface”,“ machine”与第一行(句子)相匹配。