在执行程序时,我看到一些奇怪的行为。让我解释。
我写了一个多处理类。像这样
from multiprocessing import Process
class ProcessManager:
def __init__(self, spark, logger):
self.spark = spark
self.logger = logger
def applyMultiProcessExecution(self, func_arguments, targetFunction, iterableList):
self.logger.info("Function Arguments : {}".format(func_arguments))
jobs = []
for x in iterableList:
try:
p = Process(target=targetFunction, args=(x,), kwargs=func_arguments)
jobs.append(p)
p.start()
except:
raise RuntimeError("Unable to create process for GL : {}".format(x))
for job in jobs:
job.join()
现在,我有一个名为detect的方法
def detect(self, gl, inputFolder, modelFolder, outputFolder, readWriteUtils, region):
# This reads data from inputFolder, modelFolder using readWriteUtils based on gl and region
# Does computation over data
# Writes data to outputFolder
现在,我这样调用此方法。
pm = ProcessManager(spark=spark, logger=logger)
pm.applyMultiProcessExecution(func_arguments=arguments,
targetFunction= detect,
iterableList=GL_LIST)
这正在使用火花提交步骤在EMR集群上运行。
现在,这很奇怪。 有时,这会在1分钟内完美执行。 有时,它会进入无限处理状态,当我使用CTRL C取消操作时,我可以看到数据已计算出来,但该过程并未自行关闭。
在我的火花上,我可以看到我的控制器日志看起来像这样。
2019-01-01T08:22:18.145Z INFO Ensure step 23 jar file command-runner.jar
2019-01-01T08:22:18.145Z INFO StepRunner: Created Runner for step 23
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --py-files /mnt/road-artifacts/ROAD.zip /mnt/road-artifacts/com/amazon/road/model-executor/PCAModelTestExecution.py --inputFolder=/tmp/split_data --modelFolder=/tmp/model --outputFolder=/tmp/output --region=NA'
INFO Environment:
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
LESS_TERMCAP_md=[01;38;5;208m
LESS_TERMCAP_me=[0m
HISTCONTROL=ignoredups
LESS_TERMCAP_mb=[01;31m
AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
UPSTART_JOB=rc
LESS_TERMCAP_se=[0m
HISTSIZE=1000
HADOOP_ROOT_LOGGER=INFO,DRFA
JAVA_HOME=/etc/alternatives/jre
AWS_DEFAULT_REGION=us-east-1
AWS_ELB_HOME=/opt/aws/apitools/elb
LESS_TERMCAP_us=[04;38;5;111m
EC2_HOME=/opt/aws/apitools/ec2
TERM=linux
runlevel=3
LANG=en_US.UTF-8
AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
MAIL=/var/spool/mail/hadoop
LESS_TERMCAP_ue=[0m
LOGNAME=hadoop
PWD=/
LANGSH_SOURCED=1
HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-3INAXV6LAS9A4/tmp
_=/etc/alternatives/jre/bin/java
CONSOLETYPE=serial
RUNLEVEL=3
LESSOPEN=||/usr/bin/lesspipe.sh %s
previous=N
UPSTART_EVENTS=runlevel
AWS_PATH=/opt/aws
USER=hadoop
UPSTART_INSTANCE=
PREVLEVEL=N
HADOOP_LOGFILE=syslog
HOSTNAME=ip-172-32-0-233
HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-3INAXV6LAS9A4
EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
SHLVL=5
HOME=/home/hadoop
HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-3INAXV6LAS9A4/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-3INAXV6LAS9A4/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-3INAXV6LAS9A4
INFO ProcessRunner started child process 36375 :
hadoop 36375 6380 0 08:22 ? 00:00:00 /etc/alternatives/jre/bin/java -Xmx1000m -server -XX:OnOutOfMemoryError=kill -9 %p -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/s-3INAXV6LAS9A4 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-3INAXV6LAS9A4/tmp -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30 org.apache.hadoop.util.RunJar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --py-files /mnt/road-artifacts/ROAD.zip /mnt/road-artifacts/com/amazon/road/model-executor/PCAModelTestExecution.py --inputFolder=/tmp/split_data --modelFolder=/tmp/model --outputFolder=/tmp/output --region=NA
2019-01-01T08:22:22.152Z INFO HadoopJarStepRunner.Runner: startRun() called for s-3INAXV6LAS9A4 Child Pid: 36375
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO Process still running
INFO Process still running
INFO Process still running
我读到了与多处理队列中的死锁相关的内容,但由于我没有将任何需要获取的内容放入队列中,因此该内容不适用于此处。这对我来说似乎很奇怪,因为我无法找到原因。 有人可以建议吗?