我尝试使用Python包mrjob在Google Cloud Platform中的群集上运行mapreduce作业,如下所示:
python mr_script.py -r dataproc --cluster-id [CLUSTER-ID] [gs://DATAFILE_FOLDER]
我可以针对相同的数据成功运行相同的脚本,并在本地Hadoop中使用正确的结果(使用-r hadoop
选项)。但是,大约1小时后,Google Cloud Platform中的相同作业失败,并显示以下错误消息:
Waiting for job completion - sleeping 10.0 second(s)
link_stats-vagrant-20170430-021027-125936---step-00001-of-00001 => RUNNING
Waiting for job completion - sleeping 10.0 second(s)
link_stats-vagrant-20170430-021027-125936---step-00001-of-00001 => RUNNING
Waiting for job completion - sleeping 10.0 second(s)
link_stats-vagrant-20170430-021027-125936---step-00001-of-00001 => ERROR
Step 1 of 1 failed
在Google Cloud Platform的工作节点中检查日志文件后,我在/var/log/hadoop-yarn/yarn-yarn-nodemanager-mrjob-w-13.log
中发现了以下错误消息:
2017-04-30 02:58:48,213 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3497 for container-id container_1493517964776_0001_01_004115: 447.5 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:48,217 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3097 for container-id container_1493517964776_0001_01_001385: 351.7 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:51,222 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3773 for container-id container_1493517964776_0001_01_006384: 349.3 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:51,226 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3660 for container-id container_1493517964776_0001_01_005935: 344.8 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:51,230 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3497 for container-id container_1493517964776_0001_01_004115: 447.5 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:51,234 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3097 for container-id container_1493517964776_0001_01_001385: 351.7 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:52,803 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1493517964776_0001_000001 (auth:SIMPLE)
2017-04-30 02:58:52,809 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1493517964776_0001_01_001385
2017-04-30 02:58:52,809 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=10.142.0.20 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1493517964776_0001 CONTAINERID=container_1493517964776_0001_01_001385
2017-04-30 02:58:52,809 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1493517964776_0001_01_001385 transitioned from RUNNING to KILLING
2017-04-30 02:58:52,809 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1493517964776_0001_01_001385
2017-04-30 02:58:52,810 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1493517964776_0001_000001 (auth:SIMPLE)
2017-04-30 02:58:52,812 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1493517964776_0001_01_004115
2017-04-30 02:58:52,812 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=10.142.0.20 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1493517964776_0001 CONTAINERID=container_1493517964776_0001_01_004115
2017-04-30 02:58:52,815 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1493517964776_0001_01_001385 is : 143
2017-04-30 02:58:52,821 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1493517964776_0001_000001 (auth:SIMPLE)
2017-04-30 02:58:52,823 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1493517964776_0001_01_006384
2017-04-30 02:58:52,823 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=10.142.0.20 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1493517964776_0001 CONTAINERID=container_1493517964776_0001_01_006384
2017-04-30 02:58:52,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1493517964776_0001_01_004115 transitioned from RUNNING to KILLING
2017-04-30 02:58:52,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1493517964776_0001_01_001385 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2017-04-30 02:58:52,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1493517964776_0001_01_006384 transitioned from RUNNING to KILLING
2017-04-30 02:58:52,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1493517964776_0001_01_004115
看来我的工作被容器管理器杀了,但是由于物理/逻辑内存已被超出,我的工作似乎没有被杀死(如果我错了,请纠正我)。我看到有一个错误代码143 。
你能否告诉我为什么我的工作失败了,以及如何解决它或修改mrjob中的任何设置以使工作成功运行(如果它确实是内存问题)?或者我应该在哪里检查更多线索来调试此问题?谢谢!
修改 这是失败的工作报告(Under Cloud Dataproc - > Jobs,Status:Failed,Elapsed time:48min 41sec):
配置:
Cluster mrjob
Job type Hadoop
Jar files Main class or jar
file:///usr/lib/hadoop-mapreduce/hadoop-streaming.jar
Arguments:
-files
gs://mrjob-us-east1-bb4f10dbae4d77dc/tmp/link_stats.vagrant.20170504.185601.135018/files/link_stats.py#link_stats.py
-mapper
python link_stats.py --step-num=0 --mapper
-reducer
python link_stats.py --step-num=0 --reducer
-input
gs://vc1/data/wikipedia/english
-output
gs://mrjob-us-east1-bb4f10dbae4d77dc/tmp/link_stats.vagrant.20170504.185601.135018/output/
输出:
17/05/04 19:40:52 INFO mapreduce.Job: map 74% reduce 6%
17/05/04 19:41:42 INFO mapreduce.Job: map 75% reduce 6%
17/05/04 19:41:42 INFO mapreduce.Job: Task Id : attempt_1493924193762_0001_m_000481_0, Status : FAILED
AttemptID:attempt_1493924193762_0001_m_000481_0 Timed out after 600 secs
17/05/04 19:41:42 INFO mapreduce.Job: Task Id : attempt_1493924193762_0001_m_000337_2, Status : FAILED
AttemptID:attempt_1493924193762_0001_m_000337_2 Timed out after 600 secs
17/05/04 19:41:43 INFO mapreduce.Job: map 74% reduce 6%
17/05/04 19:41:45 INFO mapreduce.Job: map 75% reduce 6%
17/05/04 19:42:12 INFO mapreduce.Job: Task Id : attempt_1493924193762_0001_m_000173_2, Status : FAILED
AttemptID:attempt_1493924193762_0001_m_000173_2 Timed out after 600 secs
17/05/04 19:42:40 INFO mapreduce.Job: map 76% reduce 6%
17/05/04 19:43:26 INFO mapreduce.Job: map 77% reduce 6%
17/05/04 19:44:16 INFO mapreduce.Job: map 78% reduce 6%
17/05/04 19:44:42 INFO mapreduce.Job: map 100% reduce 100%
17/05/04 19:44:47 INFO mapreduce.Job: Job job_1493924193762_0001 failed with state FAILED due to: Task failed task_1493924193762_0001_m_000161
Job failed as tasks failed. failedMaps:1 failedReduces:0
17/05/04 19:44:47 INFO mapreduce.Job: Counters: 45
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=110101249
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
GS: Number of bytes read=8815472899
GS: Number of bytes written=0
GS: Number of read operations=0
GS: Number of large read operations=0
GS: Number of write operations=0
HDFS: Number of bytes read=57120
HDFS: Number of bytes written=0
HDFS: Number of read operations=560
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=39
Killed map tasks=192
Killed reduce tasks=196
Launched map tasks=685
Launched reduce tasks=48
Other local map tasks=38
Rack-local map tasks=647
Total time spent by all maps in occupied slots (ms)=1015831401
Total time spent by all reduces in occupied slots (ms)=559653642
Total time spent by all map tasks (ms)=338610467
Total time spent by all reduce tasks (ms)=93275607
Total vcore-milliseconds taken by all map tasks=338610467
Total vcore-milliseconds taken by all reduce tasks=186551214
Total megabyte-milliseconds taken by all map tasks=1040211354624
Total megabyte-milliseconds taken by all reduce tasks=573085329408
Map-Reduce Framework
Map input records=594556238
Map output records=560
Map output bytes=35862346
Map output materialized bytes=36523515
Input split bytes=57120
Combine input records=0
Spilled Records=560
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=115614
CPU time spent (ms)=272576210
Physical memory (bytes) snapshot=268956098560
Virtual memory (bytes) snapshot=2461942099968
Total committed heap usage (bytes)=244810514432
File Input Format Counters
Bytes Read=8815472899
17/05/04 19:44:47 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!
Job output is complete
答案 0 :(得分:0)
我只是回答,因为我没有足够的评论点,但您应该尝试使用ResourceManager GUI运行时监控每个地图任务的状态。
可能是一个映射器失败(例如,导致未处理异常的损坏行)和mapreduce.map.failures.maxpercent = 0,导致作业终止。
答案 1 :(得分:0)
错误代码143通常表示OOME。您应该使用这些选项为映射器和缩减器设置内存大小,以确定应用程序实际使用的内存量。
-Dmapreduce.map.memory.mb = 1024 -Dmapreduce.reduce.memory.mb = 1024
另一个考虑因素是如何拆分数据。有时您的数据是倾斜的,一个映射器的记录数是其余映射器的3倍。您可以通过查看数据文件夹并确保所有文件大致相等来确定这一点。