在HTCondor中运行python脚本时,作业在.log文件中以下列错误代码终止:
006 (4069.000.000) 02/19 15:02:29 Image size of job updated: 1393668
1362 - MemoryUsage of job (MB)
1393668 - ResidentSetSize of job (KB)
...
006 (4069.000.000) 02/19 15:03:12 Image size of job updated: 33197416
1430 - MemoryUsage of job (MB)
1463300 - ResidentSetSize of job (KB)
...
005 (4069.000.000) 02/19 15:03:12 Job terminated.
(0) Abnormal termination (signal 11)
(0) No core file
Usr 0 00:00:09, Sys 0 00:00:40 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:09, Sys 0 00:00:40 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
4477484 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
4477484 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 4500 4500 1699801
Gpus : 0
Memory (MB) : 1430 5 5
...
什么可能导致这样的错误以及如何纠正错误?
谷歌搜索后,我发现了一个mialing列表,建议添加行
getenv=true
在提交文件上,我做了但没有解决问题,我收到了同样的错误。
感谢您的帮助/建议
答案 0 :(得分:0)
信号11引用SIGSEGV,这是分段错误。日志消息表明您的脚本已终止,因为它存在分段错误,Condor无法执行任何操作。您需要调试脚本,以确保它不会执行无效的内存访问等操作,从而避免导致段错误的发生。
如果秃鹰设置正确,我还将在作业描述文件中添加通知:
notification = Error
notify_user = my@email.com
在这种情况下,它会通知您您的工作异常终止。