我想研究文档中特定单词的TF-IDF分数如何取决于IDF所基于的文档数量。不幸的是,我收到的结果列表的长度各不相同,但是文档中的单词数量是固定的……无论建模文档的数量如何,如何获取文档中所有单词的TF-IDF结果? >
from gensim import corpora
from gensim import models
from gensim.models import TfidfModel
# Suppose I have a list of words with a random distribution:
docs = [
['dog', 'cat', 'panda', 'deer', 'dog', 'elephant', 'panda', 'mouse', 'dog', 'panda', 'dog', 'python', 'penguin', 'lion', 'mouse'],
['cat', 'panda', 'rhino', 'lynx', 'panda', 'panda', 'panda', 'koala', 'mammoth', 'hamster', 'cat', 'koala', 'bear', 'fright'],
['dog', 'cat', 'elephant', 'panda', 'deer', 'deer', 'baloonfish', 'pig', 'owl', 'dove', 'camel', 'camel', 'camel'],
['dog', 'panda', 'mammoth', 'snake', 'lizard', 'elephant', 'partridge', 'alpaca', 'dog', 'dog', 'lizard', 'dog'],
['dog', 'owl', 'ostrich', 'porcupine', 'mouse', 'baloonfish', 'croc', 'lion', 'chimp', 'camel', 'doe']
# Each document has a certain number of tokens and unique types:
print([len(doc) for doc in docs]) # [15, 14, 13, 12, 11]
print([len(set(doc)) for doc in docs]) # [9, 9, 10, 8, 11]
# I create a dictionary that has 30 unique tokens...
dictionary = corpora.Dictionary(docs) # ['cat', 'deer', 'dog', 'elephant', 'lion', ...]
# ...and corpus containing individual instances of 5 documents
corpus = [dictionary.doc2bow(doc) for doc in docs] # [[(0, 1), (1, 1), (2, 4), (3, 1), ...], ...]
# now I'm training tfidf model and applying this model to all corpus documents then I check their length:
model = TfidfModel(corpus)
vector = model[corpus]
print([len(v) for v in vector]) # [9, 9, 10, 8, 11]
# now I'm training my tfidf model based only on first four documents in corpus:
new_model = TfidfModel(corpus[:4])
new_vector = new_model[corpus]
print([len(v) for v in new_vector]) # [8, 8, 9, 7, 6]
# based on first three:
new_model = TfidfModel(corpus[:3])
new_vector = new_model[corpus]
print([len(v) for v in new_vector]) # [7, 7, 8, 3, 6]
# based on first two:
new_model = TfidfModel(corpus[:2])
new_vector = new_model[corpus]
print([len(v) for v in new_vector]) # [7, 7, 3, 3, 3]