文本处理 - 短语检测后的Word2Vec培训(bigram模型)

时间:2017-09-26 08:47:19

标签: python text-processing gensim word2vec python-textprocessing

我想制作一个含有更多n-gram的word2vec模型。正如我发现的那样,gensim.models.phrase中的短语类可以找到我想要的短语,并且可以在语料库上使用短语并使用它的word2vec训练函数的结果模型。

首先,我做了类似下面的事情,就像gensim documentation中的示例代码一样。

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield word_tokenize(line)

sentences = MySentences('sentences_directory')

bigram = gensim.models.Phrases(sentences)

model = gensim.models.Word2Vec(bigram['sentences'], size=300, window=5, workers=8)
已创建

模型,但评估和警告没有任何好结果:

WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable)

我搜索了它,然后找到https://groups.google.com/forum/#!topic/gensim/XWQ8fPMFSi0并更改了我的代码:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield word_tokenize(line)

class PhraseItertor(object):
    def __init__(self, my_phraser, data):
        self.my_phraser, self.data = my_phraser, data

    def __iter__(self):
        yield self.my_phraser[self.data]


sentences = MySentences('sentences_directory')

bigram_transformer = gensim.models.Phrases(sentences)

bigram = gensim.models.phrases.Phraser(bigram_transformer)

corpus = PhraseItertor(bigram, sentences)

model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)

我收到错误:

Traceback (most recent call last):
  File "/home/fatemeh/Desktop/Thesis/bigramModeler.py", line 36, in <module>
    model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)
  File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 478, in init
    self.build_vocab(sentences, trim_rule=trim_rule)
  File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 553, in build_vocab
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
  File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 575, in scan_vocab
    vocab[word] += 1
TypeError: unhashable type: 'list'

现在我想知道我的代码出了什么问题。

1 个答案:

答案 0 :(得分:0)

我在Gensim GoogleGroup询问了我的问题,Mr Gordon Mohr回答了我:

  

您通常不希望__iter__()方法执行单个操作   yield。它应该返回一个迭代器对象(准备返回多个)   对象通过next()或StopIteration异常)。一种方法   迭代器是使用yield将方法视为a   &#39;发电机&#39; - 但通常需要yield   在循环内。

     

但我现在看到你引用的主题中的我的示例代码    __iter__()返回行的错误:它不应该是   返回原始的phrasifier,但已经是一个   start-as-an-iterator,使用iter()内置方法。那   是的,那个例子应该是:

class PhrasingIterable(object):
    def __init__(self, phrasifier, texts):
        self. phrasifier, self.texts = phrasifier, texts
    def __iter__():
        return iter(phrasifier[texts])
     

对您的变体进行类似更改可能会解决TypeError: iter() returned non-iterator of type 'TransformedCorpus'错误。