我正在使用azure vnet中的azure函数生成一个火花集群。每个火花节点都位于单独的Azure容器实例组中。 这是它的工作方式: 我生成一个主节点ACI组并获取其IP地址,然后生成一个从属节点ACI组。我在生成工作程序时传递主节点的IP地址。 但是,我面临的问题是-如果我使用spark提交作业,则该作业无法完成,并且出现以下错误:
19/05/08 13:35:26 INFO BlockManagerMaster: Removal of executor 1 requested
19/05/08 13:35:26 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
19/05/08 13:35:26 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20190508133523-0015/2 on worker-20190508124850-10.0.0.6-33015 (10.0.0.6:33015) with 1 core(s)
19/05/08 13:35:26 INFO StandaloneSchedulerBackend: Granted executor ID app-20190508133523-0015/2 on hostPort 10.0.0.6:33015 with 1 core(s), 1024.0 MB RAM
19/05/08 13:35:26 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/2 is now RUNNING
19/05/08 13:35:28 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/2 is now EXITED (Command exited with code 1)
19/05/08 13:35:28 INFO StandaloneSchedulerBackend: Executor app-20190508133523-0015/2 removed: Command exited with code 1
19/05/08 13:35:28 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
19/05/08 13:35:28 INFO BlockManagerMaster: Removal of executor 2 requested
19/05/08 13:35:28 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
19/05/08 13:35:28 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20190508133523-0015/3 on worker-20190508124850-10.0.0.6-33015 (10.0.0.6:33015) with 1 core(s)
19/05/08 13:35:28 INFO StandaloneSchedulerBackend: Granted executor ID app-20190508133523-0015/3 on hostPort 10.0.0.6:33015 with 1 core(s), 1024.0 MB RAM
19/05/08 13:35:28 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/3 is now RUNNING
19/05/08 13:35:30 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/3 is now EXITED (Command exited with code 1)
19/05/08 13:35:30 INFO StandaloneSchedulerBackend: Executor app-20190508133523-0015/3 removed: Command exited with code 1
19/05/08 13:35:30 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
19/05/08 13:35:30 INFO BlockManagerMaster: Removal of executor 3 requested
19/05/08 13:35:30 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
19/05/08 13:35:30 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20190508133523-0015/4 on worker-20190508124850-10.0.0.6-33015 (10.0.0.6:33015) with 1 core(s)
19/05/08 13:35:30 INFO StandaloneSchedulerBackend: Granted executor ID app-20190508133523-0015/4 on hostPort 10.0.0.6:33015 with 1 core(s), 1024.0 MB RAM
19/05/08 13:35:30 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/4 is now RUNNING
19/05/08 13:35:32 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/4 is now EXITED (Command exited with code 1)
19/05/08 13:35:32 INFO StandaloneSchedulerBackend: Executor app-20190508133523-0015/4 removed: Command exited with code 1
19/05/08 13:35:32 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
19/05/08 13:35:32 INFO BlockManagerMaster: Removal of executor 4 requested
经过大量研究,我发现我必须对集群中所有节点的每个节点的/ etc / hosts文件进行输入 类似于以下内容:
10.0.0.4 spark-master
10.0.0.5 spark-worker-1
10.0.0.6 spark-worker-2
10.0.0.7 spark-driver
我手动进行了上述输入,然后作业成功执行。
但是,我如何以编程方式(使用azure函数本身)进行上述输入,即如何获取集群中每个节点的IP地址和主机名,并在/ etc /中为每个节点进行输入在运行Azure容器实例之后,每个其他节点的主机文件吗?