Question

所以我在 sklearn 中使用 CountVecotrizer() 函数来帮助我创建工作模型预测。

该项目的目标是获取工作描述并将其归入某个类别。我的目标是获取每一行并拥有它，以便模型厌倦某些“标记”（单词），以便它可以准确地预测分类。

例如：如果第 1 行在字符串中有 20 个单词，那么我希望所有行都包含 20 个单词，因此我需要在数组末尾添加更多 0 或缩短数组（如果有）很多话。我想在 python 中定义一个 Max_Length 来让我更容易。

想知道我将如何处理这个问题？

Answer 1

CountVectorizer 已经完成了问题的建议，在查看稀疏矩阵输出时它不太明显。如果我们将它们转换回稠密矩阵，它应该会更明显：

from sklearn.feature_extraction.text import CountVectorizer

X_raw = ["every word in this sentence is unique"]

vectorizer = CountVectorizer()
print(vectorizer.fit_transform(X_raw).todense())

带有评论的输出：

# Every word is used once
[[1 1 1 1 1 1 1]]

如果有一个词出现在一个字符串中但没有出现在另一个字符串中，它将用 0 表示：

X_raw = [
    "every word in this sentence is unique",
    "every word in this sentence is unique too too",
]

print(vectorizer.fit_transform(X_raw).todense())

带有评论的输出：

#             ----- The word 'too' was not used in the first sentence,
#            /      but it was used twice in the second sentence.
#           v
[[1 1 1 1 1 0 1 1]
 [1 1 1 1 1 2 1 1]]

在 skLearn 中填充 CountVectorizer()

1 个答案: