我的目标是使用word2vec查找给定关键字集的最相关单词。例如,如果我有一组单词[girl, kite, beach]
,我希望从word2vec输出相关单词:[flying, swimming, swimsuit...]
据我所知,word2vec会根据环绕词的上下文对一个单词进行矢量化。所以我做的是使用以下功能:
most_similar_cosmul([girl, kite, beach])
然而,它似乎提供与关键字集不太相关的词语:
['charade', 0.30288437008857727]
['kinetic', 0.3002534508705139]
['shells', 0.29911646246910095]
['kites', 0.2987399995326996]
['7-9', 0.2962781488895416]
['showering', 0.2953910827636719]
['caribbean', 0.294752299785614]
['hide-and-go-seek', 0.2939240336418152]
['turbine', 0.2933803200721741]
['teenybopper', 0.29288050532341003]
['rock-paper-scissors', 0.2928623557090759]
['noisemaker', 0.2927709221839905]
['scuba-diving', 0.29180505871772766]
['yachting', 0.2907838821411133]
['cherub', 0.2905363440513611]
['swimmingpool', 0.290039986371994]
['coastline', 0.28998953104019165]
['Dinosaur', 0.2893030643463135]
['flip-flops', 0.28784963488578796]
['guardsman', 0.28728148341178894]
['frisbee', 0.28687697649002075]
['baltic', 0.28405341506004333]
['deprive', 0.28401875495910645]
['surfs', 0.2839275300502777]
['outwear', 0.28376665711402893]
['diverstiy', 0.28341981768608093]
['mid-air', 0.2829524278640747]
['kickboard', 0.28234976530075073]
['tanning', 0.281939834356308]
['admiration', 0.28123530745506287]
['Mediterranean', 0.281186580657959]
['cycles', 0.2807052433490753]
['teepee', 0.28070521354675293]
['progeny', 0.2775532305240631]
['starfish', 0.2775339186191559]
['romp', 0.27724218368530273]
['pebbles', 0.2771730124950409]
['waterpark', 0.27666303515434265]
['tarzan', 0.276429146528244]
['lighthouse', 0.2756190896034241]
['captain', 0.2755546569824219]
['popsicle', 0.2753356397151947]
['Pohoda', 0.2751699686050415]
['angelic', 0.27499720454216003]
['african-american', 0.27493417263031006]
['dam', 0.2747344970703125]
['aura', 0.2740659713745117]
['Caribbean', 0.2739778757095337]
['necking', 0.27346789836883545]
['sleight', 0.2733519673347473]
这是我用来训练word2vec的代码
def train(data_filepath, epochs=300, num_features=300, min_word_count=2, context_size=7, downsampling=1e-3, seed=1,
ckpt_filename=None):
"""
Train word2vec model
data_filepath path of the data file in csv format
:param epochs: number of times to train
:param num_features: increase to improve generality, more computationally expensive to train
:param min_word_count: minimum frequency of word. Word with lower frequency will not be included in training data
:param context_size: context window length
:param downsampling: reduce frequency for frequent keywords
:param seed: make results reproducible for random generator. Same seed means, after training model produces same results.
:returns path of the checkpoint after training
"""
if ckpt_filename == None:
data_base_filename = os.path.basename(data_filepath)
data_filename = os.path.splitext(data_base_filename)[0]
ckpt_filename = data_filename + ".wv.ckpt"
num_workers = multiprocessing.cpu_count()
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
nltk.download("punkt")
nltk.download("stopwords")
print("Training %s ..." % data_filepath)
sentences = _get_sentences(data_filepath)
word2vec = w2v.Word2Vec(
sg=1,
seed=seed,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context_size,
sample=downsampling
)
word2vec.build_vocab(sentences)
print("Word2vec vocab length: %d" % len(word2vec.wv.vocab))
word2vec.train(sentences, total_examples=len(sentences), epochs=epochs)
return _save_ckpt(word2vec, ckpt_filename)
def _save_ckpt(model, ckpt_filename):
if not os.path.exists("checkpoints"):
os.makedirs("checkpoints")
ckpt_filepath = os.path.join("checkpoints", ckpt_filename)
model.save(ckpt_filepath)
return ckpt_filepath
def _get_sentences(data_filename):
print("Found Data:")
sentences = []
print("Reading '{0}'...".format(data_filename))
with codecs.open(data_filename, "r") as data_file:
reader = csv.DictReader(data_file)
for row in reader:
sentences.append(ast.literal_eval((row["highscores"])))
print("There are {0} sentences".format(len(sentences)))
return sentences
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Train Word2vec model')
parser.add_argument('data_filepath',
help='path to training CSV file.')
args = parser.parse_args()
data_filepath = args.data_filepath
train(data_filepath)
这是用于word2vec的训练数据样本:
22751473,"[""lover"", ""sweetheart"", ""couple"", ""dietary"", ""meal""]"
28738542,"[""mallotus"", ""villosus"", ""shishamo"", ""smelt"", ""dried"", ""fish"", ""spirinchus"", ""lanceolatus""]"
25163686,"[""Snow"", ""Removal"", ""snow"", ""clearing"", ""female"", ""females"", ""woman"", ""women"", ""blower"", ""snowy"", ""road"", ""operate""]"
32837025,"[""milk"", ""breakfast"", ""drink"", ""cereal"", ""eating""]"
23828321,"[""jogging"", ""female"", ""females"", ""lady"", ""woman"", ""women"", ""running"", ""person""]"
22874156,"[""lover"", ""sweetheart"", ""heterosexual"", ""couple"", ""man"", ""and"", ""woman"", ""consulting"", ""hear"", ""listening""]
对于预测,我只是将以下函数用于一组关键字:
most_similar_cosmul
我想知道是否可以使用word2vec找到相关的关键字。如果不是,那么什么机器学习模型将更适合于此。任何见解都会非常有用
答案 0 :(得分:1)
在向['girl', 'kite', 'beach']
/ most_similar()
提供多个正字示例(例如most_similar_cosmul()
)时,这些字词的向量将首先进行平均,然后是最相似的单词列表到了平均回报。与单个单词的简单检查相比,这些单词可能与任何一个单词无关。所以:
当您在单个单词上尝试most_similar()
(或most_similar_cosmul()
)时,会得到什么样的结果?它们是否与您关心的方式似乎与输入词有关?
如果没有,则在尝试多字相似之前,您的设置中存在更深层次的问题需要修复。
Word2Vec从(1)大量训练数据中得到了通常的结果; (2)自然语言句子。有了足够的数据,典型数量的epochs
训练通道(以及默认值)就是5.通过使用更多的纪元迭代或更小的向量size
,有时可以弥补更少的数据, 但不总是。
目前尚不清楚您拥有多少数据。此外,您的示例行不是真正的自然语言句子 - 它们似乎已经应用了一些其他预处理/重新排序。这可能会伤害而不是帮助。
通过丢弃更多低频词(将min_count
增加到默认值5以上,而不是将其减少到2),词向量通常会得到改善。低频词不会消失; t有足够的例子来获得好的向量 - 并且他们拥有的少数例子,即使重复多次迭代,也往往是单词的特殊例子。用法,不您可以从许多不同的例子中得到的一般性广义表示。通过在训练数据中保留这些注定要失败的词语,对其他更频繁词语的训练也会受到干扰。 (当你得到一个你不认为属于最相似排名的词时,它可能是一个罕见的词,鉴于它的少数出现的上下文,它找到了那些坐标作为最不好位置的方式在许多其他无用的坐标中。)
如果您确实从单字检查中获得了良好的结果,而不是从多个单词的平均值中获得了良好的结果,那么结果可能会随着更多更好的数据或调整后的训练参数而得到改善 - 但要实现这一目标。需要更严格地定义你认为好的结果。 (你现有的清单对我来说看起来并不坏:它包含许多与太阳/沙滩/沙滩活动相关的词语。)
另一方面,你对Word2Vec的期望可能过高:与单个单词本身相比,['girl', 'kite', 'beach']
的平均值可能不一定与那些所需单词相关,或者可能只是可通过大量数据集/参数调整实现。