Question

我了解如何使用矢量化程序获取idf值和词汇表。对于词汇来说，单词的频率是值，而单词是字典的键，但是，我想要的值是idf值。

我无法尝试任何事情，因为我不知道如何使用sklearn。

from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
        "The dog.",
        "The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

上面提供的代码是我最初尝试使用的代码。

此后，我提出了一个不使用scikit的新解决方案：

            for string in text_array: 
                for word in string:
                    if word not in total_dict.keys(): # build up a word frequency in the dictionary
                        total_dict[word] = 1
                    else:
                        total_dict[word] += 1
            for word in total_dict.keys(): # calculate the tf-idf of each word in the dictionary using this url: https://nlpforhackers.io/tf-idf/
                total_dict[word] = math.log(len(text_array) / float(1 + total_dict[word]))
                print("word", word, ":" , total_dict[word])

让我知道上面的代码片段是否足以对发生的事情进行合理的估计。我提供了指向我所用指南的链接。

Answer 1

您可以首次直接使用vectorizer.fit_transform(text)。它的作用是根据文本中的所有单词/标记构建词汇集。

然后您可以使用vectorizer.transform(anothertext)来矢量化具有与前一个文本相同的映射关系的另一个文本。

更多说明：

fit()是要从训练集中学习词汇和idf。 transform()用于根据前fit()学到的词汇对文档进行转换。

因此，您只应该执行一次fit()，并且可以进行多次变换。

需要创建IDF值字典，将单词与其IDF值相关联

1 个答案: