gensim.model.TfidfModel
是否保存了术语频率?
从docs开始,他们使用公式:
weights_i_j = frequency_i_j * log_2(D / doc_freq_i)
当我使用以下代码探测dir(model)
(TfidfModel对象)的属性时:
>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>>
>>> dataset = api.load("text8")
>>> dct = Dictionary(dataset) # fit dictionary
>>> corpus = [dct.doc2bow(line) for line in dataset] # convert dataset to BoW format
>>>
>>> model = TfidfModel(corpus) # fit model
>>> dir(model)
我得到了:
['__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_adapt_by_suffix',
'_apply',
'_load_specials',
'_save_specials',
'_smart_save',
'dfs',
'id2word',
'idfs',
'initialize',
'load',
'normalize',
'num_docs',
'num_nnz',
'save',
'wglobal',
'wlocal']
但我似乎无法找到存储术语频率的位置。
如果没有保存术语频率,是否有原因?因为它已经存储以计算权重。
有没有办法在拟合过程中以某种方式检索术语频率?
答案 0 :(得分:1)
术语频率表示术语在文档中出现的频率。
BOW语料库将每个文档中a中的每个术语转换为(tokenId,频率)。
例如
import gensim.corpora import Dictionary
from nltk.tokenize import RegexpTokenizer
text_data = ['dog cat horse donkey', 'dog woof cat meow', 'horse horse horse', 'lion tiger wolf']
# Tokenise each string
tokeniser = RegexpTokenizer(r'\w+')
# Create list of tokens
docs = list(map(tokeniser.tokenize, text_data))
# Create Dictionary mapping tokens to their ids
dct = Dictionary(docs)
# Create BOW corpus: (tokenId, frequency) for each token in each doc
corpus = [dct.doc2bow(line) for line in docs]
现在检查您的BOW语料库以查看术语频率
# BOW Corpus
In [56]: corpus
Out[56]:
[[(0, 1), (1, 1), (2, 1), (3, 1)],
[(0, 1), (1, 1), (4, 1), (5, 1)],
[(3, 3)],
[(6, 1), (7, 1), (8, 1)]]
检查字典以查看tokenId到术语的映射
# Dictionary
In [60]: dct.token2id
Out[60]:
{u'cat': 0,
u'dog': 1,
u'donkey': 2,
u'horse': 3,
u'lion': 6,
u'meow': 4,
u'tiger': 7,
u'wolf': 8,
u'woof': 5}