gensim.model.TfidfModel是否保存了术语频率?

时间:2018-02-12 09:46:18

标签: python nlp counter gensim tf-idf

gensim.model.TfidfModel是否保存了术语频率?

docs开始,他们使用公式:

weights_i_j = frequency_i_j * log_2(D / doc_freq_i)

当我使用以下代码探测dir(model)(TfidfModel对象)的属性时:

>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>>
>>> dataset = api.load("text8")
>>> dct = Dictionary(dataset)  # fit dictionary
>>> corpus = [dct.doc2bow(line) for line in dataset]  # convert dataset to BoW format
>>>
>>> model = TfidfModel(corpus)  # fit model
>>> dir(model)

我得到了:

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_adapt_by_suffix',
 '_apply',
 '_load_specials',
 '_save_specials',
 '_smart_save',
 'dfs',
 'id2word',
 'idfs',
 'initialize',
 'load',
 'normalize',
 'num_docs',
 'num_nnz',
 'save',
 'wglobal',
 'wlocal']

但我似乎无法找到存储术语频率的位置。

如果没有保存术语频率,是否有原因?因为它已经存储以计算权重。

有没有办法在拟合过程中以某种方式检索术语频率?

1 个答案:

答案 0 :(得分:1)

术语频率表示术语在文档中出现的频率。

BOW语料库将每个文档中a中的每个术语转换为(tokenId,频率)。

例如

import gensim.corpora import Dictionary
from nltk.tokenize import RegexpTokenizer

text_data = ['dog cat horse donkey', 'dog woof cat meow', 'horse horse horse', 'lion tiger wolf']

# Tokenise each string   
tokeniser = RegexpTokenizer(r'\w+')

# Create list of tokens
docs = list(map(tokeniser.tokenize, text_data))

# Create Dictionary mapping tokens to their ids
dct = Dictionary(docs) 

# Create BOW corpus: (tokenId, frequency) for each token in each doc
corpus = [dct.doc2bow(line) for line in docs]

现在检查您的BOW语料库以查看术语频率

# BOW Corpus
 In [56]: corpus
 Out[56]:
 [[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(0, 1), (1, 1), (4, 1), (5, 1)],
 [(3, 3)],
 [(6, 1), (7, 1), (8, 1)]]

检查字典以查看tokenId到术语的映射

# Dictionary
In [60]: dct.token2id
Out[60]:
{u'cat': 0,
 u'dog': 1,
 u'donkey': 2,
 u'horse': 3,
 u'lion': 6,
 u'meow': 4,
 u'tiger': 7,
 u'wolf': 8,
 u'woof': 5}