将计数向量器转换为TF-IDF

时间:2019-09-28 03:41:38

标签: python pandas machine-learning

所以我有下表,每行是一个文档,每列是单词,没有单词出现。

|doc|apple|banana|cat| 
|---|---|---|---| 
|1|2|0|0| 
|2|0|0|2| 
|3|0|2|0|

是否有任何方法可以将这些计数向量化表转换为tf-idf向量化器?

编辑:我的解决方案。让我知道这是否正确。

def get_tfidf(df_tfidf):

total_docs = df_tfidf.shape[0]

#Term Frequency
#(Number of times term w appears in a document) / (Total number of 
#terms in the document)

total_words_doc = df_tfidf.astype(bool).sum(axis=1)
tf = df_tfidf.values/total_words_doc[:,None]

#Inverse document frequency
#log_e(Total number of documents / Number of documents with term w in 
#it)
words_in_doc = df_tfidf.astype(bool).sum(axis=0)
idf = np.log(total_docs/words_in_doc)

tf_idf = tf*idf.values[None,:]

return tf_idf

1 个答案:

答案 0 :(得分:0)

假设您有一个pandas.DataFrame这样的Count Vectorizer,如下所示:

import pandas as pd
data = [[1,2,0,0],[2,0,0,2],[3,0,2,0]]
df = pd.DataFrame(data,columns=['doc','apple','banana','cat'])
df

输出

doc apple   banana  cat
0   1   2   0   0
1   2   0   0   2
2   3   0   2   0

然后您可以使用sklearn.feature_extraction.text.TfidfVectorizer来获取tf-idf向量,如下所示:

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df)
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)

输出

apple  banana  cat  doc
0    0.0     0.0  0.0  1.0
1    1.0     0.0  0.0  0.0
2    0.0     1.0  0.0  0.0
3    0.0     0.0  1.0  0.0