TF-IDF表示

时间:2016-05-31 18:22:59

标签: python python-3.x tf-idf

我一直致力于TF-IDF的实施。

我能够为一组句子计算tf-idf。

如何使矩阵的表示看起来更好?

我的剧本:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv     # imports the csv module
from nltk.stem.porter import PorterStemmer

models = []

f = open('C:\docs\temp\sample.csv', "r")
   reader = csv.reader(f)  # creates the reader object
   for row in reader:   # iterates the rows of the file in orders
    models.append(row)
f.close()
 print('models:'+str(models))

 tm = []
 tm = models

print('tm'+str(tm))

token_dict = {}
 stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
    stemmed.append(stemmer.stem(item))
return stemmed

def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems

token_text = [tokenize(str(i)) for i in models]
print(token_text)

tokenized_models = [word_tokenize(str(i)) for i in models]

stopset = set(stopwords.words('english'))

mydoclist = []
for m in tokenized_models:
    stop_m = [i for i in m if str(i) not in stopset]
    mydoclist.append(stop_m)
print('mydoclist'+str(mydoclist))

from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(tokenizer=lambda doc: doc,lowercase=False,    analyzer='word',  min_df = 0, stop_words = 'english')

tfidf_matrix =  tf.fit_transform(mydoclist)
feature_names = tf.get_feature_names()
tv = tf.vocabulary_
print('vocab'+str(tv))
print(tfidf_matrix.todense())

例如。我有一个CSV如下:

id, text
1,jake loves me more than john loves me
2,july likes me more than robert loves me
3,He likes videogames more than baseball

使用上面我能够得到这个输出:

[[ 0.2551054   0.          0.43193099  0.          0.2551054   0.2551054
   0.          0.43193099  0.          0.6569893   0.          0.        ]
 [ 0.28807865  0.          0.          0.48775955  0.28807865  0.28807865
   0.          0.          0.37095371  0.37095371  0.48775955  0.        ]
 [ 0.27463443  0.46499651  0.          0.          0.27463443  0.27463443
   0.46499651  0.          0.35364183  0.          0.          0.46499651]]

我希望演示文稿以表格形式作为结果矩阵,其中行是单个句子,列是术语/单词。因此,每个单元格代表句子中术语的tf-idf分数。

Doc_id    word1 word2 word3 word4 word5 word6
1         0.    0.23  0.    0.232 0.    0.22
2         0.    0.3   0.    0.    0.    0.
3         0     0.1   0.    0.22  0.    0.

1 个答案:

答案 0 :(得分:-1)

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(tokenizer=lambda doc: 
doc,lowercase=False,    analyzer='word',  min_df = 0, stop_words = 'english')

tfidf_matrix =  tf.fit_transform(mydoclist)
feature_names = tf.get_feature_names()
df = pd. DataFrame(tfidf_matrix. todense())
df.columns = tf. get_feature_name()
print(df.head())