从tf-idf稀疏矩阵中获取顶部单词(最高tf-idf值)

时间:2017-10-11 17:23:18

标签: python feature-extraction tf-idf sklearn-pandas

我有一个大小为208(208个句子数组)的列表,如下所示:

ALTER TABLE tableA ADD address TEXT;
UPDATE tableA SET address = (SELECT address FROM tableB WHERE tableA.ID = tx_out.ID);

Table schema:
CREATE TABLE tableA (
    ID TEXT,
    column1 INT,
    column2 TEXT,
);

CREATE TABLE tx_out (
    ID TEXT,
    sequence INT,
    address TEXT
);

我想获得具有最高tf-idf值的单词。 我创建了一个tf-idf矩阵:

all_words = [["this is a sentence ... "] , [" another one hello bob this is alice ... "] , ["..."] ...] 

现在我不知道如何获得具有最高tf-idf值的单词。

from sklearn.feature_extraction.text import TfidfVectorizer tokenize = lambda doc: doc.split(" ") sklearn_tfidf = TfidfVectorizer(norm='l2', tokenizer=tokenize, ngram_range=(1,2)) tfidf_matrix = sklearn_tfidf.fit_transform(all_words) sentences = sklearn_tfidf.get_feature_names() dense_tfidf = tfidf_matrix.todense() 的每一列代表一个单词/ 2个单词。 (矩阵是208x5481)

当我总结每一列时,它并没有真正的帮助 - 得到了一个简单的顶级单词的相同结果(我猜因为它与简单的单词计数相同)。

如何获得具有最高tf-idf值的单词?或者我如何明智地将其正常化?

1 个答案:

答案 0 :(得分:2)

发生了类似的问题,但在https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f发现了此问题,只需根据您的数据框更改X和y输入。博客中的代码如下。 Sklearn的文档为我提供了帮助:http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html

from sklearn.feature_selection import chi2
import numpy as np
N = 2
for Product, category_id in sorted(category_to_id.items()):
features_chi2 = chi2(features, labels == category_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(Product))
print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))