gensim总是在削减词汇量

时间:2017-07-24 20:13:05

标签: python gensim

我有一个大数据集,我正在尝试运行Word2Vec模型,但词汇量不断降低到28。

>>> model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=1,trim_rule=None, workers=4,sg=0, hs=1)
>>> len(model.wv.vocab)
28

我尝试过不同的构造函数设置仍然是一样的。

我的数据集包含机器日志:

wc eventlog_dataset
  4421775 124189284 978608310 eventlog_dataset

我之前在同一个数据集上运行了tfidf模型,我确信我有~10万个独特的单词。

当我在gensim中使用不同的数据集时,我没有这样的问题,所以我肯定知道问题是我的数据集,但我不知道为什么......

以下是一个示例:

2017-05-16 10:55:58.91 CDT     3 61617032 Notification    Minor           Command error   sw_cli     {user super all {{0 8}} -1 10.0.188.216 3136} {Command: getfs  Error: Error: File Services is not configured on this array.} {}
2017-05-16 10:55:32.58 CDT     3 61616917 Notification    Minor           Command error   sw_cli     {user super all {{0 8}} -1 10.0.51.11 3727} {Command: getcage -e cage12 Error:    Opcode         = SCCMD_DOCDB    Node           = 253    Tpd error code = TE_INVALID          -- Invalid input parameter    Tpd error info = Cage (cage12) does not support this function } {}

根据gensim文档trim_rule=None,min_count=1应该留下完整的词汇。

之前有没有人在数据集上遇到过这样的问题?

修改

这是代码

class FileToSent(object):
    def __init__(self, filename):
        self.filename = filename
       def __iter__(self):
            for line in open(self.filename, 'r'):
             ll = [i for i in unicode(line, 'utf-8').lower().split()]
             print ll
            yield ll


    sentences = FileToSent('/home/veselin/eventlog_dataset')
    model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=2,workers=4, hs=1)

这是第一行的输出:

/usr/bin/python2.7 /home/veselin/PycharmProjects/test/word2vec.py
[u'2016-10-16', u'17:55:19.55', u'cest', u'1', u'1788217', u'notification', u'minor', u'cli', u'command', u'error', u'sw_cli', u'{3parsvc', u'super', u'all', u'{{0', u'8}}', u'-1', u'172.16.24.110', u'12539}', u'{command:', u'getsralertcrit', u'all', u'error:', u'this', u'system', u'is', u'not', u'licensed', u'for', u'system', u'reporter', u'features}', u'{}']

您可以看到词汇表中没有包含cli,system或license等词语。

INFO记录(在完整数据集上)

/usr/bin/python2.7 /home/veselin/PycharmProjects/test/word2vec.py
2017-07-28 11:32:56,966 : INFO : collecting all words and their counts
2017-07-28 11:33:35,580 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-07-28 11:33:35,582 : INFO : collected 28 word types from a corpus of 29 raw words and 1 sentences
2017-07-28 11:33:35,582 : INFO : Loading a fresh vocabulary
2017-07-28 11:33:35,582 : INFO : min_count=2 retains 1 unique words (3% of original 28, drops 27)
2017-07-28 11:33:35,582 : INFO : min_count=2 leaves 2 word corpus (6% of original 29, drops 27)
2017-07-28 11:33:35,583 : INFO : deleting the raw counts dictionary of 28 items
2017-07-28 11:33:35,584 : INFO : sample=0.001 downsamples 1 most-common words
2017-07-28 11:33:35,584 : INFO : downsampling leaves estimated 0 word corpus (3.3% of prior 2)
2017-07-28 11:33:35,584 : INFO : estimated required memory for 1 words and 100 dimensions: 1900 bytes
2017-07-28 11:33:35,584 : INFO : constructing a huffman tree from 1 words
2017-07-28 11:33:35,585 : INFO : built huffman tree with maximum node depth 0
2017-07-28 11:33:35,585 : INFO : resetting layer weights
2017-07-28 11:33:35,585 : INFO : training model with 4 workers on 1 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2017-07-28 11:36:43,871 : INFO : PROGRESS: at 100.00% examples, 0 words/s, in_qsize 2, out_qsize 2
2017-07-28 11:36:43,872 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-07-28 11:36:43,873 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-07-28 11:36:43,873 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-07-28 11:36:43,873 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-07-28 11:36:43,873 : INFO : training on 145 raw words (0 effective words) took 188.3s, 0 effective words/s
2017-07-28 11:36:43,873 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

Process finished with exit code 0

2 个答案:

答案 0 :(得分:0)

您是否看过词汇,看看它保留了哪些“词汇”?

尝试评估/打印:

model.wv.index2word

当您看到所有“字词”的共同点时,请检查您提供的sentences语料库。

每个单独的项目(句子)是一个令牌列表,还是一个字符列表? Word2Vec期望前者:已经被标记化的文本,而不是原始字符串。

答案 1 :(得分:0)

它已经通过了几次,但我希望这仍然可以对某人有所帮助。

如果我正确理解了您的问题,则您希望文件中的所有单词都包含在w2v词汇表中。 如果是这样,则应定义一个trim_rule来保留所有单词并将其传递给“ build_vocab”函数。

这里是一个例子:

from gensim.models import Word2Vec
from gensim.utils import RULE_KEEP

documents_list = [["first", "document"], ["second", "document"]]

def _rule(word, count, min_count): # params are needed
    return RULE_KEEP

model = Word2Vec()
# model = Word2Vec.load("path_to_your_pretrained_model") # if you are using a pre-trained w2v model

model.build_vocab(documents_list, trim_rule=_rule) # use update=True if the model has already been trained e.g. pre-trained models

print(model.wv.vocab) 
# should print something like 
# {'first':<gensim.models.keyedvectors.Vocab at 0x17a34271b88>,
#  'document':<gensim.models.keyedvectors.Vocab at 0x17a31737e88>,
#  'second': <gensim.models.keyedvectors.Vocab at 0x17a31737ec8>}

model.train(documents_list, total_examples=len(documents_list), epochs=model.epochs)
print(model.most_similar("first"))
# should give
# [('second', 0.026407353579998016), ('document', -0.04318903386592865)]

build_vocab函数文档(请参见trim_rule和更新参数)here