我是一名物理系学生,他试图进行一项具有随机元素的研究相关模拟。模拟可以分成几个非相互作用的部分,每个部分随机演变,因此,不需要在运行之间进行交互。
我使用不同的代码/文件来稍后分析所有作业返回的结果(与问题无关,仅用于提供正在发生的事情的清晰背景图片)。
我使用该机构的HPC(我将其称为“群集”)来运行我的代码的多个副本,这是一个单独的.py文件,它不会从任何其他文件中读取任何内容(但会创建输出)文件)。代码的每个副本/实现都应该使用os.makedirs(path,exist_ok=True)
然后os.chdir(path)
为每个单独的代码实现创建一个唯一的工作目录。
我已经做了很多尝试来运行它,以下面的行为类型结束:
这些行为对我来说似乎是完全随机的,因为我不知道哪个阵列作业能够完美地工作,哪些行为会有行为2,行为3或两者(可能是这样的情况:对于一个大的工作阵列,我会有一些工作运行良好,一些显示行为2,一些显示行为3以及只有2或只是3)。
我已经尝试过在网上找到的所有内容,例如我在某处读到使用os.makedirs
的常见问题是umask
问题并且在调用前添加os.umask(0)
这是一个很好的做法,所以我添加了它。我还读到,有时群集可能会被挂起,所以调用time.sleep
几秒钟然后再次尝试可能会有效,所以我也这样做了。什么都没有解决问题......
我附加了可能是检查的罪魁祸首的部分代码N,L,T
和DT
是我在代码中设置的数字,我也导入了库等等(注意办公室计算机运行Windows,而集群运行Linux,所以我只使用os.name
根据我正在运行的操作系统设置我的目录,这样代码可以在两个系统上运行而无需修改):
when = datetime.datetime.now()
date = when.date()
worker_num = os.environ['LSB_JOBINDEX']
pid = os.environ['LSB_JOBID']
work = 'worker'+worker_num
txt_file = 'N{}_L{}_T{}_DT{}'.format(N, L,T, DT)
if os.name == 'nt':
path = 'D:/My files/Python Scripts/Cluster/{}/{}/{}'.format(date,txt_file,work)
else:
path = '/home/labs/{}/{}/{}'.format(date,txt_file,work)
os.umask(0)
try:
os.makedirs(path, exist_ok=True)
os.chdir(path)
except OSError:
time.sleep(10)
with open('/home/labs/error_{}_{}.txt'.format(txt_file,work),'a+') as f:
f.write('In {}, at time {}, job ID: {}, which was sent to queue: {}, working on host: {}, failed to create path: {} '.format(date, hour, pid,os.environ['LSB_QUEUE'], os.environ['LSB_HOSTS'], path))
os.makedirs(path, exist_ok=True)
os.chdir(path)
群集的环境是LSF环境。为了运行我的代码的多个实现,我使用“arrayjob”命令,即使用LSF将相同代码的多个实例(在本例中为100)发送到集群中不同(或相同)主机上的几个不同的CPU。 / p>
我还附上了显示上述错误的示例。 行为2的示例是以下输出文件:
Stst progress = 10.0% after 37 seconds
Stst progress = 10.0% after 42 seconds
Stst progress = 20.0% after 64 seconds
Stst progress = 20.0% after 75 seconds
Stst progress = 30.0% after 109 seconds
Stst progress = 40.0% after 139 seconds
worker99 is 5.00% finished after 0.586 hours and will finish in approx 11.137 hours
worker99 is 5.00% finished after 0.691 hours and will finish in approx 13.130 hours
worker99 is 10.00% finished after 1.154 hours and will finish in approx 10.382 hours
worker99 is 10.00% finished after 1.340 hours and will finish in approx 12.062 hours
worker99 is 15.00% finished after 1.721 hours and will finish in approx 9.753 hours
worker99 is 15.00% finished after 1.990 hours and will finish in approx 11.275 hours
worker99 is 20.00% finished after 2.287 hours and will finish in approx 9.148 hours
worker99 is 20.00% finished after 2.633 hours and will finish in approx 10.532 hours
worker99 is 25.00% finished after 2.878 hours and will finish in approx 8.633 hours
worker99 is 25.00% finished after 3.275 hours and will finish in approx 9.826 hours
worker99 is 30.00% finished after 3.443 hours and will finish in approx 8.033 hours
worker99 is 30.00% finished after 3.921 hours and will finish in approx 9.149 hours
worker99 is 35.00% finished after 4.015 hours and will finish in approx 7.456 hours
worker99 is 35.00% finished after 4.566 hours and will finish in approx 8.480 hours
worker99 is 40.00% finished after 4.616 hours and will finish in approx 6.924 hours
worker99 is 45.00% finished after 5.182 hours and will finish in approx 6.334 hours
worker99 is 40.00% finished after 5.209 hours and will finish in approx 7.814 hours
worker99 is 50.00% finished after 5.750 hours and will finish in approx 5.750 hours
worker99 is 45.00% finished after 5.981 hours and will finish in approx 7.310 hours
worker99 is 55.00% finished after 6.322 hours and will finish in approx 5.173 hours
worker99 is 50.00% finished after 6.623 hours and will finish in approx 6.623 hours
worker99 is 60.00% finished after 6.927 hours and will finish in approx 4.618 hours
worker99 is 55.00% finished after 7.266 hours and will finish in approx 5.945 hours
worker99 is 65.00% finished after 7.513 hours and will finish in approx 4.046 hours
worker99 is 60.00% finished after 7.928 hours and will finish in approx 5.285 hours
worker99 is 70.00% finished after 8.079 hours and will finish in approx 3.463 hours
worker99 is 65.00% finished after 8.580 hours and will finish in approx 4.620 hours
worker99 is 75.00% finished after 8.644 hours and will finish in approx 2.881 hours
worker99 is 80.00% finished after 9.212 hours and will finish in approx 2.303 hours
worker99 is 70.00% finished after 9.227 hours and will finish in approx 3.954 hours
worker99 is 85.00% finished after 9.778 hours and will finish in approx 1.726 hours
worker99 is 75.00% finished after 9.882 hours and will finish in approx 3.294 hours
worker99 is 90.00% finished after 10.344 hours and will finish in approx 1.149 hours
worker99 is 80.00% finished after 10.532 hours and will finish in approx 2.633 hours
这样的.txt文件,用于跟踪代码的进度,通常由每个作业单独创建并存储在自己的目录中。在这种情况下,由于某种原因,两个不同的作业正在写入同一文件。在观察创建目录并确定工作目录后立即创建的其他.txt文件时,会对此进行验证:
In 2016-04-01, at time 02:11:51.851948, job ID: 373244, which was sent to
queue: new-short, working on host: cn129.wexac.weizmann.ac.il, has created
path: /home/labs/2016-04-02/N800_L1600_T10_DT0.5/worker99
In 2016-04-01, at time 02:12:09.968549, job ID: 373245, which was sent to
queue: new-medium, working on host: cn293.wexac.weizmann.ac.il, has created
path: /home/labs/2016-04-02/N800_L1600_T10_DT0.5/worker99
我非常感谢能解决这个问题的任何帮助,因为它阻碍了我们推进研究。如果需要任何额外的细节来解决这个问题,我很乐意提供它们 谢谢!
答案 0 :(得分:0)
查看您提供的错误日志,显示两个作业(373244和373245)正在发送到两个不同的队列:
2016-04-01,时间02:11:51.851948,职位编号:373244,已发送至 queue:new-short ,...
在2016-04-01,时间02:12:09.968549,职位编号:373245,被发送至 队列:新媒体,...
这表明阵列作业被发送两次到两个单独的队列。您可能会查看发出阵列作业的代码,以确保它只运行一次,将作业发送到单个队列。
多次发出数组作业会导致我认为你看到的行为。