Question

我仍然是神经网络和NLP的初学者。在这段代码中，我正在使用skip-gram训练干净的文本（一些推文）。但是我不知道我做得是否正确。谁能告诉我有关这种跳过语法训练的正确性吗？任何帮助表示赞赏。

这是我的代码：

from nltk import word_tokenize

from gensim.models.phrases import Phrases, Phraser

sent = [row.split() for row in X['clean_text']]

phrases = Phrases(sent, max_vocab_size = 50, progress_per=10000)

bigram = Phraser(phrases)

sentences = bigram[sent]

from gensim.models import Word2Vec

w2v_model = Word2Vec(window=5,
                     size = 300,
                     sg=1)

w2v_model.build_vocab(sentences)


w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=25)


del sentences #to reduce memory usage


def get_mat(model, corpus, size):

    vecs = np.zeros((len(corpus), size))

    n = 0

    for i in corpus.index:
        vecs[i] = np.zeros(size).reshape((1, size))
        for word in str(corpus.iloc[i,0]).split():
            try:
                vecs[i] += model[word]
                #n += 1
            except KeyError:
                continue

    return vecs

X_sg = get_vectors(w2v_model, X, 300)

del X

X_sg=pd.DataFrame(X_sg)
X_sg.head()
from sklearn import preprocessing
scale = preprocessing.normalize
X_sg=scale(X_sg)

for i in range(len(X_sg)):
    X_sg[i]+=1 #I did this because some weights where negative! So could not 
               #apply LSTM on them later

Answer 1

您没有提到是否收到任何错误或结果不令人满意，因此很难知道您可能需要哪种帮助。

您涉及Word2Vec模型的特定代码行是大致正确的：可能有用的参数（如果您的数据集足够大以训练300维向量），以及正确的步骤。因此，真正的证据就是您的结果是否可以接受。

关于您事先尝试使用Phrases bigram-creation：

在添加这种额外的预处理复杂性之前，您应该使一切正常运行并取得可喜的结果。
参数max_vocab_size=50严重被误导，可能使短语-step毫无意义。 max_vocab_size严格限制了班级统计了多少个单词/二字组合，以此来限制其记忆使用。（只要已知单词/二元组的数目达到此上限，就会修剪掉许多低频单词/二元组–实际上，所有单词/二元组中的大多数每次都会被删节，以换取很多准确性，以换取上限的内存使用率。） max_vocab_size中的gensim默认值为40,000,000，但是gensim方法所基于的Google word2phrase.c源的默认值为500,000,000。仅使用50，就不会真正学到任何有用的知识，无论50个单词/二字组合在许多修剪之后仍然存活。

关于您的get_mat()函数和更高版本的DataFrame代码，我不知道您打算如何使用它，因此无法对此发表任何意见。

这种带有跳过语法的文字训练正确吗？

1 个答案: