提交和执行主机之间的套接字意外关闭

时间:2014-11-24 15:02:37

标签: sockets sas cluster-computing winscp

我正在尝试在群集上运行SAS文件。 SAS文件myprogram.sas的内容如下所示:

data a;
   input myvar1;
   myvar2 = myvar1 + 100 ;
   datalines;
       0
       1
       2
       3
       4
       5
;
proc print;
run;

我创建了一个Condor文件来执行群集上的SAS文件。 Condor文件mycondorcode.condor的内容如下所示,但我更改了电子邮件地址:

####################
#
# Submit SAS code to Condor cluster
#
# Submit this to run on the cluster with condor_submit THIS-FILENAME.condor
#
####################

UNIVERSE                = vanilla
NOTIFICATION            = Complete
NOTIFY_USER             = mark.miller@zzz.org

REQUIREMENTS            = (OpSys == "LINUX" && HAS_SAS )
GETENV                  = TRUE

EXECUTABLE              = /usr/local/bin/sas
ARGUMENTS               = -nodms -noterminal
INPUT                   = myprogram.sas
OUTPUT                  = $(INPUT).out
ERROR                   = $(INPUT).err
LOG                     = $(INPUT).log

QUEUE

我使用名为SAS的应用程序将CondorWinSCP.exe文件复制到群集中,我想将SAS文件转换为群集可以理解的格式,I猜一下dos2unix命令。

然后我使用SAS通过输入以下内容将PuTTY文件提交到群集:

condor_submit mycondorcode.condor

当我输入:

condor_q

我明白了:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
58683.0   markm          11/24 14:41   0+00:00:00 I  0   0.0  sas -nodms -noterm

状态(ST)无论我等多久都会保持I

我可以在我的目录中看到一个名为myprogram.sas的文本文件,其中包含以下内容(除了我已经更改了电子邮件地址并更改了看起来可能是IP地址的数字):

000 (58683.000.000) 11/24 14:41:55 Job submitted from host: <14.4.104.1:42259>
...
022 (58683.000.000) 11/24 14:42:56 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@node13.hpc.zzz.org <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:42:56 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@node13.hpc.zzz.org, rescheduling job
...
022 (58683.000.000) 11/24 14:43:56 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@node13.hpc.zzz.org <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:43:56 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@node13.hpc.zzz.org, rescheduling job
...
022 (58683.000.000) 11/24 14:44:56 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@node13.hpc.zzz.org <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:44:56 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@node13.hpc.zzz.org, rescheduling job
...
022 (58683.000.000) 11/24 14:45:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@node13.hpc.zzz.org <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:45:57 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@node13.hpc.zzz.org, rescheduling job
...

我从未成功使用此群集,但已在其他群集上运行R。我几乎不了解当前的集群。基于我上面提供的内容,看起来我做错了什么,或者看起来是否存在必须由操作集群的IT部门解决的连接问题?

感谢您提出任何建议,我可能会尝试从Windows桌面端解决此问题,而一般情况下几乎完全不熟悉Unixclusters。也许我对WinSCP.exe做错了。也许我可以尝试使用WinSCP

,而不是使用dos2unix

0 个答案:

没有答案