如果我使用GPU和经过GPU优化的Debian映像创建了一个全新的GCE VM实例,则无法通过浏览器SSH窗口或使用第三方SSH客户端(在上传公钥之后)通过SSH进入该实例。
我尝试了建议here,但没有帮助。
如果我创建的实例没有GPU且具有标准的Ubuntu映像,则开箱即可正常工作。
关于GPU深度学习实例我缺少什么吗?
编辑:
GCloud命令重新创建:
gcloud beta compute --project=avid-compound-233309 instances create instance-1 --zone=us-central1-a --machine-type=n1-standard-1 --subnet=default --network-tier=PREMIUM --maintenance-policy=TERMINATE --service-account=105060870131-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=type=nvidia-tesla-k80,count=1 --image=c0-common-gce-gpu-image-20191213 --image-project=ml-images --boot-disk-size=50GB --boot-disk-type=pd-standard --boot-disk-device-name=instance-1 --reservation-affinity=any
是的,它是在创建VM之后立即发生的,并且在串行端口1日志中有大量错误日志,简短示例:
[ 9.393769] google_accounts_daemon[692]: File "<frozen importlib._bootstrap>", line 574, in module_from_spec
[ 9.394022] google_accounts_daemon[692]: AttributeError: 'NoneType' object has no attribute 'loader'
[ 9.394250] google_accounts_daemon[692]: Remainder of file ignored
[ 9.394504] google_accounts_daemon[692]: Traceback (most recent call last):
[ 9.394767] google_accounts_daemon[692]: File "/usr/bin/google_accounts_daemon", line 6, in <module>
[ 9.395108] google_accounts_daemon[692]: from pkg_resources import load_entry_point
[ 9.395344] google_accounts_daemon[692]: File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
[ 9.395502] google_accounts_daemon[692]: from pkg_resources.extern import six
[ 9.395719] google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'
Dec 23 19:40:05 localhost google_accounts_daemon[692]: File "/usr/lib/python3.5/site.py", line 173, in addpackage
Dec 23 19:40:05 localhost google_accounts_daemon[692]: exec(line)
Dec 23 19:40:05 localhost google_accounts_daemon[692]: File "<string>", line 1, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]: File "<frozen importlib._bootstrap>", line 574, in module_from_spec
Dec 23 19:40:05 localhost google_accounts_daemon[692]: AttributeError: 'NoneType' object has no attribute 'loader'
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Remainder of file ignored
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Traceback (most recent call last):
Dec 23 19:40:05 localhost google_accounts_daemon[692]: File "/usr/bin/google_accounts_daemon", line 6, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]: from pkg_resources import load_entry_point
Dec 23 19:40:05 localhost google_accounts_daemon[692]: File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]: from pkg_resources.extern import six
Dec 23 19:40:05 localhost google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'
答案 0 :(得分:1)
似乎新发布的映像“ GPU优化的Debian m32(带有CUDA 10.0)(c0-common-gce-gpu-image-20191213)”包含损坏的EXT文件系统。目录,配置和脚本文件包含垃圾。因此,首次启动时的初始配置失败。
Started Flush Journal to Persistent Storage.
Starting Create Volatile Files and Directories...
[ 4.880071] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 144, inode_bitmap = 4718608
[ 4.883559] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 145, inode_bitmap = 4718609
[ 4.887054] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 146, inode_bitmap = 4718610
...
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ecdsa_key.pub is not a public key file.
localhost dhclient[516]:
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ed25519_key.pub is not a public key file.
localhost ssh-generate-hostk[ [0;32m OK [0m] Started Getty on tty1.
...
keys[485]: /etc/ssh/ssh_host_rsa_key.pub is not a public key file.
Public Issue Tracker上有一个最近创建的公共发行:https://issuetracker.google.com/146807209
应该尽快修复。