所以我有下表,每行是一个文档,每列是单词,没有单词出现。
|doc|apple|banana|cat|
|---|---|---|---|
|1|2|0|0|
|2|0|0|2|
|3|0|2|0|
是否有任何方法可以将这些计数向量化表转换为tf-idf向量化器?
编辑:我的解决方案。让我知道这是否正确。
def get_tfidf(df_tfidf):
total_docs = df_tfidf.shape[0]
#Term Frequency
#(Number of times term w appears in a document) / (Total number of
#terms in the document)
total_words_doc = df_tfidf.astype(bool).sum(axis=1)
tf = df_tfidf.values/total_words_doc[:,None]
#Inverse document frequency
#log_e(Total number of documents / Number of documents with term w in
#it)
words_in_doc = df_tfidf.astype(bool).sum(axis=0)
idf = np.log(total_docs/words_in_doc)
tf_idf = tf*idf.values[None,:]
return tf_idf
答案 0 :(得分:0)
假设您有一个pandas.DataFrame
这样的Count Vectorizer,如下所示:
import pandas as pd
data = [[1,2,0,0],[2,0,0,2],[3,0,2,0]]
df = pd.DataFrame(data,columns=['doc','apple','banana','cat'])
df
输出:
doc apple banana cat
0 1 2 0 0
1 2 0 0 2
2 3 0 2 0
然后您可以使用sklearn.feature_extraction.text.TfidfVectorizer
来获取tf-idf向量,如下所示:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df)
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)
输出:
apple banana cat doc
0 0.0 0.0 0.0 1.0
1 1.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0
3 0.0 0.0 1.0 0.0