我想研究文档中特定单词的TF-IDF分数如何取决于IDF所基于的文档数量。不幸的是,我收到的结果列表的长度各不相同,但是文档中的单词数量是固定的……无论建模文档的数量如何,如何获取文档中所有单词的TF-IDF结果? >
我使用Gensim库计算TF-IDF比率。这是我的方法:
from gensim import corpora
from gensim import models
from gensim.models import TfidfModel
# Suppose I have a list of words with a random distribution:
docs = [
['dog', 'cat', 'panda', 'deer', 'dog', 'elephant', 'panda', 'mouse', 'dog', 'panda', 'dog', 'python', 'penguin', 'lion', 'mouse'],
['cat', 'panda', 'rhino', 'lynx', 'panda', 'panda', 'panda', 'koala', 'mammoth', 'hamster', 'cat', 'koala', 'bear', 'fright'],
['dog', 'cat', 'elephant', 'panda', 'deer', 'deer', 'baloonfish', 'pig', 'owl', 'dove', 'camel', 'camel', 'camel'],
['dog', 'panda', 'mammoth', 'snake', 'lizard', 'elephant', 'partridge', 'alpaca', 'dog', 'dog', 'lizard', 'dog'],
['dog', 'owl', 'ostrich', 'porcupine', 'mouse', 'baloonfish', 'croc', 'lion', 'chimp', 'camel', 'doe']
]
# Each document has a certain number of tokens and unique types:
print([len(doc) for doc in docs]) # [15, 14, 13, 12, 11]
print([len(set(doc)) for doc in docs]) # [9, 9, 10, 8, 11]
# I create a dictionary that has 30 unique tokens...
dictionary = corpora.Dictionary(docs) # ['cat', 'deer', 'dog', 'elephant', 'lion', ...]
# ...and corpus containing individual instances of 5 documents
corpus = [dictionary.doc2bow(doc) for doc in docs] # [[(0, 1), (1, 1), (2, 4), (3, 1), ...], ...]
# now I'm training tfidf model and applying this model to all corpus documents then I check their length:
model = TfidfModel(corpus)
vector = model[corpus]
print([len(v) for v in vector]) # [9, 9, 10, 8, 11]
到目前为止很好,但是现在我想将这些结果与为构建模型所基于的少量文档所获得的结果进行比较。为此,请执行以下操作:
# now I'm training my tfidf model based only on first four documents in corpus:
new_model = TfidfModel(corpus[:4])
new_vector = new_model[corpus]
print([len(v) for v in new_vector]) # [8, 8, 9, 7, 6]
# based on first three:
new_model = TfidfModel(corpus[:3])
new_vector = new_model[corpus]
print([len(v) for v in new_vector]) # [7, 7, 8, 3, 6]
# based on first two:
new_model = TfidfModel(corpus[:2])
new_vector = new_model[corpus]
print([len(v) for v in new_vector]) # [7, 7, 3, 3, 3]
有人可以向我解释为什么结果数量在减少吗?例如,第一个文档中的唯一令牌数为9;但是当在较少的文档上训练模型时,令牌的数量突然下降到8、7等。但是此文档包含固定数量的令牌。为什么不将所有这些都包括在结果中?如何包含它们?也许我做错了...感谢您的帮助。