Question

我是Word2Vec的新手，我正在尝试根据它们的相似性对单词进行聚类。首先，我使用nltk分隔句子，然后使用句子的结果列表作为Word2Vec的输入。但是，当我打印词汇时，它只是一堆字母，数字和符号，而不是单词。具体而言，字母之一的示例是“ ，'L'：”

# imports needed and logging
import gensim
from gensim.models import word2vec
import logging

import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
with open('C:\\Users\\Freddy\\Desktop\\Thesis\\Descriptions.txt','r') as f_open:
    text = f_open.read()
arr = []

sentences = nltk.sent_tokenize(text) # this gives a list of sentences

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)

model = word2vec.Word2Vec(sentences, size = 300)

print(model.wv.vocab)

Answer 1

由于Word2Vec类的tutorial和documentation暗示该类的构造函数需要单词列表作为第一个参数（或通常来说单词迭代器的迭代器）：

句子（可迭代的迭代，可选）–可迭代的句子可以只是令牌列表的列表，但对于较大的令牌语料库，...

我相信，在将sentences输入Word2Vec之前，您需要在每个句子上使用words_tokenize来将关键行更改为：

sentences = [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text)]

TL; DR

您将字母作为“单词”，因为Word2Vec将与句子相对应的字符串视为包含单词的可迭代对象。遍历字符串会导致字母顺序。这些字母被用作模型学习的基础（而不是预期的单词）。

古语有云：垃圾箱-垃圾箱。

Word2Vec vocab仅产生字母和符号

1 个答案: