我一直致力于TF-IDF的实施。
我能够为一组句子计算tf-idf。
如何使矩阵的表示看起来更好?
我的剧本:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv # imports the csv module
from nltk.stem.porter import PorterStemmer
models = []
f = open('C:\docs\temp\sample.csv', "r")
reader = csv.reader(f) # creates the reader object
for row in reader: # iterates the rows of the file in orders
models.append(row)
f.close()
print('models:'+str(models))
tm = []
tm = models
print('tm'+str(tm))
token_dict = {}
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
token_text = [tokenize(str(i)) for i in models]
print(token_text)
tokenized_models = [word_tokenize(str(i)) for i in models]
stopset = set(stopwords.words('english'))
mydoclist = []
for m in tokenized_models:
stop_m = [i for i in m if str(i) not in stopset]
mydoclist.append(stop_m)
print('mydoclist'+str(mydoclist))
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(tokenizer=lambda doc: doc,lowercase=False, analyzer='word', min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(mydoclist)
feature_names = tf.get_feature_names()
tv = tf.vocabulary_
print('vocab'+str(tv))
print(tfidf_matrix.todense())
例如。我有一个CSV如下:
id, text
1,jake loves me more than john loves me
2,july likes me more than robert loves me
3,He likes videogames more than baseball
使用上面我能够得到这个输出:
[[ 0.2551054 0. 0.43193099 0. 0.2551054 0.2551054
0. 0.43193099 0. 0.6569893 0. 0. ]
[ 0.28807865 0. 0. 0.48775955 0.28807865 0.28807865
0. 0. 0.37095371 0.37095371 0.48775955 0. ]
[ 0.27463443 0.46499651 0. 0. 0.27463443 0.27463443
0.46499651 0. 0.35364183 0. 0. 0.46499651]]
我希望演示文稿以表格形式作为结果矩阵,其中行是单个句子,列是术语/单词。因此,每个单元格代表句子中术语的tf-idf分数。
Doc_id word1 word2 word3 word4 word5 word6
1 0. 0.23 0. 0.232 0. 0.22
2 0. 0.3 0. 0. 0. 0.
3 0 0.1 0. 0.22 0. 0.
答案 0 :(得分:-1)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(tokenizer=lambda doc:
doc,lowercase=False, analyzer='word', min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(mydoclist)
feature_names = tf.get_feature_names()
df = pd. DataFrame(tfidf_matrix. todense())
df.columns = tf. get_feature_name()
print(df.head())