我正在尝试在群集上运行SAS
文件。 SAS
文件myprogram.sas
的内容如下所示:
data a;
input myvar1;
myvar2 = myvar1 + 100 ;
datalines;
0
1
2
3
4
5
;
proc print;
run;
我创建了一个Condor
文件来执行群集上的SAS
文件。 Condor
文件mycondorcode.condor
的内容如下所示,但我更改了电子邮件地址:
####################
#
# Submit SAS code to Condor cluster
#
# Submit this to run on the cluster with condor_submit THIS-FILENAME.condor
#
####################
UNIVERSE = vanilla
NOTIFICATION = Complete
NOTIFY_USER = mark.miller@zzz.org
REQUIREMENTS = (OpSys == "LINUX" && HAS_SAS )
GETENV = TRUE
EXECUTABLE = /usr/local/bin/sas
ARGUMENTS = -nodms -noterminal
INPUT = myprogram.sas
OUTPUT = $(INPUT).out
ERROR = $(INPUT).err
LOG = $(INPUT).log
QUEUE
我使用名为SAS
的应用程序将Condor
和WinSCP.exe
文件复制到群集中,我想将SAS
文件转换为群集可以理解的格式,I猜一下dos2unix
命令。
然后我使用SAS
通过输入以下内容将PuTTY
文件提交到群集:
condor_submit mycondorcode.condor
当我输入:
condor_q
我明白了:
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
58683.0 markm 11/24 14:41 0+00:00:00 I 0 0.0 sas -nodms -noterm
状态(ST
)无论我等多久都会保持I
。
我可以在我的目录中看到一个名为myprogram.sas
的文本文件,其中包含以下内容(除了我已经更改了电子邮件地址并更改了看起来可能是IP地址的数字):
000 (58683.000.000) 11/24 14:41:55 Job submitted from host: <14.4.104.1:42259>
...
022 (58683.000.000) 11/24 14:42:56 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@node13.hpc.zzz.org <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:42:56 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot1@node13.hpc.zzz.org, rescheduling job
...
022 (58683.000.000) 11/24 14:43:56 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@node13.hpc.zzz.org <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:43:56 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot1@node13.hpc.zzz.org, rescheduling job
...
022 (58683.000.000) 11/24 14:44:56 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@node13.hpc.zzz.org <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:44:56 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot1@node13.hpc.zzz.org, rescheduling job
...
022 (58683.000.000) 11/24 14:45:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@node13.hpc.zzz.org <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:45:57 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot1@node13.hpc.zzz.org, rescheduling job
...
我从未成功使用此群集,但已在其他群集上运行R
。我几乎不了解当前的集群。基于我上面提供的内容,看起来我做错了什么,或者看起来是否存在必须由操作集群的IT部门解决的连接问题?
感谢您提出任何建议,我可能会尝试从Windows桌面端解决此问题,而一般情况下几乎完全不熟悉Unix
或clusters
。也许我对WinSCP.exe
做错了。也许我可以尝试使用WinSCP
?
dos2unix