尝试使用Deepdict,使用pyspark运行gensim word2vec

时间:2016-07-13 05:42:34

标签: python pyspark gensim word2vec

from deepdist import DeepDist

from gensim.models.word2vec import Word2Vec

from pyspark import SparkConf, SparkContext

conf = (SparkConf()
     .setAppName("Work2Vec")
)

sc = SparkContext(conf=conf)
corpus = sc.textFile('AllText.txt').map(lambda s: s.split())

def gradient(model, sentences):

    syn0, syn1 = model.syn0.copy(), model.syn1.copy()   # previous weights
    model.train(sentences)
    return {'syn0': model.syn0 - syn01, 'syn1': model.syn1 - syn1}


def descent(model, update):

    model.syn0 += update['syn0']

    model.syn1 += update['syn1']


with DeepDist(Word2Vec(corpus.collect())) as dd:

    dd.train(corpus, gradient, descent)

    dd.model.save("Model")

请帮帮我,我有一个56Gb文本,想要建立一个word2Vec模型,但只使用gensim非常慢,所以我在网上尝试深度和他们的示例代码,所以我只是想知道有没有人见过这种错误

运行此脚本时的输出:

script output

1 个答案:

答案 0 :(得分:0)

请注意,您复制和粘贴的代码有一个拼写错误,可通过此拉取请求进行更正:https://github.com/dirkneumann/deepdist/pull/1