我无法让HTCondor完成我的工作。我一直在攻击这个,我正在尝试随机的事情,所以我认为我应该寻求指导。
我从Ubuntu 15.04上的website安装了HTCondor 8.2.9。以下是有关我的系统的以下信息。
$ cat /etc/condor/condor_config.local
#
# Local Condor Config
#
CONDOR_HOST = aidan-laptop
DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR
#FLOCK_TO = aidan-laptop
FLOCK_FROM = aidan-laptop localhost
我当前的主机名
$ hostname
aidan-laptop
我定义的主机
$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 aidan-laptop
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
我目前的状态
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@aidan-laptop LINUX X86_64 Unclaimed Idle 0.090 1976 0+00:04:39
slot2@aidan-laptop LINUX X86_64 Unclaimed Idle 0.000 1976 0+00:05:05
slot3@aidan-laptop LINUX X86_64 Unclaimed Idle 0.000 1976 0+00:05:06
slot4@aidan-laptop LINUX X86_64 Unclaimed Idle 0.000 1976 0+00:05:07
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 4 0 0 4 0 0 0
Total 4 0 0 4 0 0 0
看一下队列
$ condor_q
-- Submitter: aidan-laptop : <192.168.1.151:39444> : aidan-laptop
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 aidan 8/26 09:27 0+00:00:00 I 0 0.0 hello.sh
1.1 aidan 8/26 09:27 0+00:00:00 I 0 0.0 hello.sh
1.2 aidan 8/26 09:27 0+00:00:00 I 0 0.0 hello.sh
3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
$ date
Wed Aug 26 09:52:33 PDT 2015
$ lsb_release -r
Release: 15.04
尝试分析作业挂起然后打印和错误
$ date; condor_q -pool 1.00 -analyze; date
Wed Aug 26 09:58:01 PDT 2015
Error: Could not fetch startd ads
Wed Aug 26 09:59:01 PDT 2015
我的StartLog从停止开始,
$ sudo service condor stop
$ sudo rm /var/log/condor/StartLog
$ date; sudo service condor start
Wed Aug 26 10:01:02 PDT 2015
$ sleep 1m; date; condor_status
Wed Aug 26 10:02:19 PDT 2015
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@aidan-laptop LINUX X86_64 Unclaimed Idle 0.160 1976 0+00:00:04
slot2@aidan-laptop LINUX X86_64 Unclaimed Idle 0.000 1976 0+00:00:31
slot3@aidan-laptop LINUX X86_64 Unclaimed Idle 0.000 1976 0+00:00:32
slot4@aidan-laptop LINUX X86_64 Unclaimed Idle 0.000 1976 0+00:00:33
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 4 0 0 4 0 0 0
Total 4 0 0 4 0 0 0
$ date; cat /var/log/condor/StartLog
Wed Aug 26 10:02:35 PDT 2015
08/26/15 10:01:03 ******************************************************
08/26/15 10:01:03 ** condor_startd (CONDOR_STARTD) STARTING UP
08/26/15 10:01:03 ** /usr/sbin/condor_startd
08/26/15 10:01:03 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
08/26/15 10:01:03 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
08/26/15 10:01:03 ** $CondorVersion: 8.2.9 Aug 12 2015 BuildID: 335399 $
08/26/15 10:01:03 ** $CondorPlatform: x86_64_Ubuntu14 $
08/26/15 10:01:03 ** PID = 2487
08/26/15 10:01:03 ** Log last touched time unavailable (No such file or directory)
08/26/15 10:01:03 ******************************************************
08/26/15 10:01:03 Using config source: /etc/condor/condor_config
08/26/15 10:01:03 Using local config sources:
08/26/15 10:01:03 /etc/condor/condor_config.local
08/26/15 10:01:03 config Macros = 60, Sorted = 60, StringBytes = 1596, TablesBytes = 2208
08/26/15 10:01:03 CLASSAD_CACHING is ENABLED
08/26/15 10:01:03 Daemon Log is logging: D_ALWAYS D_ERROR
08/26/15 10:01:03 DaemonCore: command socket at <192.168.1.151:47358>
08/26/15 10:01:03 DaemonCore: private command socket at <192.168.1.151:47358>
08/26/15 10:01:09 VM-gahp server reported an internal error
08/26/15 10:01:09 VM universe will be tested to check if it is available
08/26/15 10:01:09 History file rotation is enabled.
08/26/15 10:01:09 Maximum history file size is: 20971520 bytes
08/26/15 10:01:09 Number of rotated history files is: 2
08/26/15 10:01:09 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1.000000, Memory: 1976, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1.000000, Memory: 1976, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1.000000, Memory: 1976, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1.000000, Memory: 1976, Swap: 25.00%, Disk: 25.00%
08/26/15 10:01:09 slot1: New machine resource allocated
08/26/15 10:01:09 Setting up slot pairings
08/26/15 10:01:09 slot2: New machine resource allocated
08/26/15 10:01:09 Setting up slot pairings
08/26/15 10:01:09 slot3: New machine resource allocated
08/26/15 10:01:09 Setting up slot pairings
08/26/15 10:01:09 slot4: New machine resource allocated
08/26/15 10:01:09 Setting up slot pairings
08/26/15 10:01:09 CronJobList: Adding job 'mips'
08/26/15 10:01:09 CronJobList: Adding job 'kflops'
08/26/15 10:01:09 CronJob: Initializing job 'mips' (/usr/lib/condor/libexec/condor_mips)
08/26/15 10:01:09 CronJob: Initializing job 'kflops' (/usr/lib/condor/libexec/condor_kflops)
08/26/15 10:01:09 slot1: State change: IS_OWNER is false
08/26/15 10:01:09 slot1: Changing state: Owner -> Unclaimed
08/26/15 10:01:09 State change: RunBenchmarks is TRUE
08/26/15 10:01:09 slot1: Changing activity: Idle -> Benchmarking
08/26/15 10:01:09 BenchMgr:StartBenchmarks()
08/26/15 10:01:09 slot2: State change: IS_OWNER is false
08/26/15 10:01:09 slot2: Changing state: Owner -> Unclaimed
08/26/15 10:01:09 State change: RunBenchmarks is TRUE
08/26/15 10:01:09 slot2: Changing activity: Idle -> Benchmarking
08/26/15 10:01:09 slot2: Changing activity: Benchmarking -> Idle
08/26/15 10:01:09 slot3: State change: IS_OWNER is false
08/26/15 10:01:09 slot3: Changing state: Owner -> Unclaimed
08/26/15 10:01:09 State change: RunBenchmarks is TRUE
08/26/15 10:01:09 slot3: Changing activity: Idle -> Benchmarking
08/26/15 10:01:09 slot3: Changing activity: Benchmarking -> Idle
08/26/15 10:01:09 slot4: State change: IS_OWNER is false
08/26/15 10:01:09 slot4: Changing state: Owner -> Unclaimed
08/26/15 10:01:09 State change: RunBenchmarks is TRUE
08/26/15 10:01:09 slot4: Changing activity: Idle -> Benchmarking
08/26/15 10:01:09 slot4: Changing activity: Benchmarking -> Idle
08/26/15 10:01:35 State change: benchmarks completed
08/26/15 10:01:35 slot1: Changing activity: Benchmarking -> Idle
如果需要更多信息,请与我们联系。
更新:
我在谈判者日志中找到了这个。我无法弄清楚它的含义。
08/26/15 11:20:15 ---------- Started Negotiation Cycle ----------
08/26/15 11:20:15 Phase 1: Obtaining ads from collector ...
08/26/15 11:20:15 Getting startd private ads ...
08/26/15 11:20:15 condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from collector at <127.0.1.1:9618>.
08/26/15 11:20:15 IO: Failed to read packet header
08/26/15 11:20:15 Couldn't fetch ads: communication error
08/26/15 11:20:15 Aborting negotiation cycle