我有一个代表文章的字符串,名为text
。我尝试在其上运行TFIDF并获得数据帧。结果数据框应将每个单词作为列名。这是我的尝试:
corpus = [text]
tfidf_transformer = TfidfVectorizer(min_df=1, ngram_range=(1,1), use_idf=True)
tfidf_df = tfidf_transformer.fit_transform(corpus)
tfidf_df = pd.DataFrame(tfidf_df.toarray())
print 'tfidf_df: ', tfidf_df.head()
此代码运行后,我将数字作为我的列名而不是代表TFIDF功能的单词。
如何在text
字符串中找到每个单词的列?
谢谢!
答案 0 :(得分:4)
您可以使用vocabulary_
TfidfVectorizer.
属性
示例强>:
# -*- coding: utf-8 -*-
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
corpus = ["quick brown fox", "something else"]
tf_idf = TfidfVectorizer(min_df=1, ngram_range=(1,1), use_idf=True).fit(corpus)
vocab = tf_idf.vocabulary_
tf_idf_df = tf_idf.transform(corpus)
# make sure keys are sorted
tf_idf_df = pd.DataFrame(tf_idf_df.toarray(), columns=sorted(vocab.keys()))
tf_idf_df
brown else fox quick something
0 0.57735 0.000000 0.57735 0.57735 0.000000
1 0.00000 0.707107 0.00000 0.00000 0.707107