如何在不手动下载模型的情况下访问/使用Google预先训练的Word2Vec模型?

时间:2019-09-18 03:05:58

标签: python google-cloud-platform nlp google-compute-engine word2vec

我想使用Word2Vec模型在Google Cloud Platform(GCP)的Google Compute服务器上分析一些文本。

但是,https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/中未压缩的word2vec模型超过3.5GB,手动下载并将其上传到云实例将需要一些时间。

是否可以在不自行上传的情况下访问Google Compute服务器上的此(或任何其他)经过预先训练的Word2Vec模型?

3 个答案:

答案 0 :(得分:4)

您还可以使用Gensim通过下载器api下载它们:

import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)

或从命令行:

python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)

有关可用数据集的列表,请检查:https://github.com/RaRe-Technologies/gensim-data

答案 1 :(得分:1)

除了手动下载内容外,您还可以在Kaggle数据集上使用预打包的版本(不是Google的第三方版本)。

首先注册Kaggle并获取凭据https://github.com/Kaggle/kaggle-api#api-credentials

然后,在命令行上执行此操作:

pip3 install kaggle
mkdir -p /content/.kaggle/
echo '{"username":"****","key":"****"}' > $HOME/.kaggle/kaggle.json
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download alvations/vegetables-google-word2vec
unzip $HOME/content/vegetables-google-word2vec.zip

最后,在Python中:

import pickle 
import numpy as np
import os

home = os.environ["HOME"]
embeddings = np.load(os.path.join(home, 'content/word2vec.news.negative-sample.300d.npy'))
with open(os.path.join(home, 'content/word2vec.news.negative-sample.300d.txt')) as fp:
    tokens = [line.strip() for line in fp]
embeddings[tokens.index('hello')]

关于Colab的完整示例:https://colab.research.google.com/drive/178WunB1413VE2SHe5d5gc0pqAd5v6Cpl


P / S:要访问其他预包装的词嵌入,请参见https://github.com/alvations/vegetables

答案 2 :(得分:0)

以下代码将在大约10秒钟内在Colab(或任何其他Jupyter笔记本)上完成工作:

result = !wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p'
code = result[-1]
arg =' --load-cookies cookies.txt "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" -O GoogleNews-vectors-negative300.bin.gz' % code
!wget $arg

如果需要在python脚本中使用,请用wget库替换requests请求:

import requests
import re 
import shutil

url1 = 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM'
resp = requests.get(url1)
code = re.findall('.*confirm=([0-9A-Za-z_]+).*', str(resp.content))
url2 = "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" % code[0]
with requests.get(url2, stream=True, cookies=resp.cookies) as r:
    with open('GoogleNews-vectors-negative300.bin.gz', 'wb') as f:
        shutil.copyfileobj(r.raw, f)