GCP-无法通过SSH进入新的GPU深度学习VM实例

时间:2019-12-23 10:40:44

标签: google-cloud-platform google-compute-engine

如果我使用GPU和经过GPU优化的Debian映像创建了一个全新的GCE VM实例,则无法通过浏览器SSH窗口或使用第三方SSH客户端(在上传公钥之后)通过SSH进入该实例。

我尝试了建议here,但没有帮助。

如果我创建的实例没有GPU且具有标准的Ubuntu映像,则开箱即可正常工作。

关于GPU深度学习实例我缺少什么吗?

编辑:

GCloud命令重新创建:

gcloud beta compute --project=avid-compound-233309 instances create instance-1 --zone=us-central1-a --machine-type=n1-standard-1 --subnet=default --network-tier=PREMIUM --maintenance-policy=TERMINATE --service-account=105060870131-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=type=nvidia-tesla-k80,count=1 --image=c0-common-gce-gpu-image-20191213 --image-project=ml-images --boot-disk-size=50GB --boot-disk-type=pd-standard --boot-disk-device-name=instance-1 --reservation-affinity=any

是的,它是在创建VM之后立即发生的,并且在串行端口1日志中有大量错误日志,简短示例:

[    9.393769] google_accounts_daemon[692]:     File "<frozen importlib._bootstrap>", line 574, in module_from_spec
[    9.394022] google_accounts_daemon[692]:   AttributeError: 'NoneType' object has no attribute 'loader'
[    9.394250] google_accounts_daemon[692]: Remainder of file ignored
[    9.394504] google_accounts_daemon[692]: Traceback (most recent call last):
[    9.394767] google_accounts_daemon[692]:   File "/usr/bin/google_accounts_daemon", line 6, in <module>
[    9.395108] google_accounts_daemon[692]:     from pkg_resources import load_entry_point
[    9.395344] google_accounts_daemon[692]:   File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
[    9.395502] google_accounts_daemon[692]:     from pkg_resources.extern import six
[    9.395719] google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "/usr/lib/python3.5/site.py", line 173, in addpackage
Dec 23 19:40:05 localhost google_accounts_daemon[692]:       exec(line)
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "<string>", line 1, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "<frozen importlib._bootstrap>", line 574, in module_from_spec
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   AttributeError: 'NoneType' object has no attribute 'loader'
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Remainder of file ignored
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Traceback (most recent call last):
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   File "/usr/bin/google_accounts_daemon", line 6, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     from pkg_resources import load_entry_point
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     from pkg_resources.extern import six
Dec 23 19:40:05 localhost google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'

1 个答案:

答案 0 :(得分:1)

似乎新发布的映像“ GPU优化的Debian m32(带有CUDA 10.0)(c0-common-gce-gpu-image-20191213)”包含损坏的EXT文件系统。目录,配置和脚本文件包含垃圾。因此,首次启动时的初始配置失败。

Started Flush Journal to Persistent Storage.
Starting Create Volatile Files and Directories...
[ 4.880071] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 144, inode_bitmap = 4718608
[ 4.883559] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 145, inode_bitmap = 4718609
[ 4.887054] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 146, inode_bitmap = 4718610
...
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ecdsa_key.pub is not a public key file.
localhost dhclient[516]: 
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ed25519_key.pub is not a public key file.
localhost ssh-generate-hostk[ [0;32m  OK   [0m] Started Getty on tty1.
...
keys[485]: /etc/ssh/ssh_host_rsa_key.pub is not a public key file.

Public Issue Tracker上有一个最近创建的公共发行:https://issuetracker.google.com/146807209

应该尽快修复。