我想使用Word2Vec模型在Google Cloud Platform(GCP)的Google Compute服务器上分析一些文本。
但是,https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/中未压缩的word2vec模型超过3.5GB,手动下载并将其上传到云实例将需要一些时间。
是否可以在不自行上传的情况下访问Google Compute服务器上的此(或任何其他)经过预先训练的Word2Vec模型?
答案 0 :(得分:4)
您还可以使用Gensim通过下载器api下载它们:
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)
或从命令行:
python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)
有关可用数据集的列表,请检查:https://github.com/RaRe-Technologies/gensim-data
答案 1 :(得分:1)
除了手动下载内容外,您还可以在Kaggle数据集上使用预打包的版本(不是Google的第三方版本)。
首先注册Kaggle并获取凭据https://github.com/Kaggle/kaggle-api#api-credentials
然后,在命令行上执行此操作:
pip3 install kaggle
mkdir -p /content/.kaggle/
echo '{"username":"****","key":"****"}' > $HOME/.kaggle/kaggle.json
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download alvations/vegetables-google-word2vec
unzip $HOME/content/vegetables-google-word2vec.zip
最后,在Python中:
import pickle
import numpy as np
import os
home = os.environ["HOME"]
embeddings = np.load(os.path.join(home, 'content/word2vec.news.negative-sample.300d.npy'))
with open(os.path.join(home, 'content/word2vec.news.negative-sample.300d.txt')) as fp:
tokens = [line.strip() for line in fp]
embeddings[tokens.index('hello')]
关于Colab的完整示例:https://colab.research.google.com/drive/178WunB1413VE2SHe5d5gc0pqAd5v6Cpl
P / S:要访问其他预包装的词嵌入,请参见https://github.com/alvations/vegetables
答案 2 :(得分:0)
以下代码将在大约10秒钟内在Colab(或任何其他Jupyter笔记本)上完成工作:
result = !wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p'
code = result[-1]
arg =' --load-cookies cookies.txt "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" -O GoogleNews-vectors-negative300.bin.gz' % code
!wget $arg
如果需要在python脚本中使用,请用wget
库替换requests
请求:
import requests
import re
import shutil
url1 = 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM'
resp = requests.get(url1)
code = re.findall('.*confirm=([0-9A-Za-z_]+).*', str(resp.content))
url2 = "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" % code[0]
with requests.get(url2, stream=True, cookies=resp.cookies) as r:
with open('GoogleNews-vectors-negative300.bin.gz', 'wb') as f:
shutil.copyfileobj(r.raw, f)