我正在尝试使用python Mrjob在Amazon EMR上运行map reduce作业,我在安装依赖项时遇到了一些问题。
我的mrjob代码:
from mrjob.job import MRJob
import re
from normalize import *
from compute_features import *
#Some code
normalize和compute_features文件有很多依赖项,包括numpy,scipy,sklearn,fiona,......
我的mrjob.conf文件:
runners:
emr:
aws_access_key_id: xxxx
aws_secret_access_key: xxxx
aws_region: eu-west-1
ec2_key_pair: EMR
ec2_key_pair_file: /Users/antoinerigoureau/Documents/emr.pem
ssh_tunnel: true
ec2_instance_type: m3.xlarge
ec2_master_instance_type: m3.xlarge
num_ec2_instances: 1
cmdenv:
TZ: Europe/Paris
bootstrap_python: false
bootstrap:
- curl -s https://s3-eu-west-1.amazonaws.com/data-essence/utils/bootstrap.sh | sudo bash -s
- source /usr/local/ripple/venv/bin/activate
- sudo pip install -r req.txt#
upload_archives:
- /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data
upload_files:
- /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/normalize.py
- /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/compute_features.py
python_bin: /usr/local/ripple/venv/bin/python3
enable_emr_debugging: True
setup:
- source /usr/local/ripple/venv/bin/activate
local:
upload_archives:
- /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data
我的bootstrap.sh文件是:
#/bin/bash
set -e
set -x
yum update -y
# install yum packages
yum install -y gcc\
geos-devel\
gcc-c++\
atlas-sse3-devel\
lapack-devel\
libpng-devel\
freetype-devel\
zlib-devel\
ncurses-devel\
readline-devel\
patch\
make\
libtool\
curl\
openssl-devel\
screen
pushd $HOME
# install python
rm -rf Python-3.5.1.tgz
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz &&\
tar -xzvf Python-3.5.1.tgz
pushd Python-3.5.1
./configure
make -j 4
make install
popd
export PATH=/usr/local/bin:$PATH
echo export PATH=/usr/local/bin:\$PATH: > /etc/profile.d/usr_local_path.sh
chmod +x /etc/profile.d/usr_local_path.sh
pip3.5 install --upgrade pip virtualenv
mkdir -p /usr/local/ripple/venv
virtualenv /usr/local/ripple/venv
source /usr/local/ripple/venv/bin/activate
# install gdal
rm -rf gdal191.zip
wget http://download.osgeo.org/gdal/gdal191.zip &&\
unzip gdal191.zip
#
# Here is the trick I had to add to get around the following -fPIC error
# /usr/bin/ld: /root/gdal-1.9.1/frmts/o/.libs/aaigriddataset.o: relocation R_X86_64_32S against `vtable for AAIGRasterBand' can not be used when making a shared object; recompile with -fPIC
#
pushd gdal-1.9.1
./configure
CC="gcc -fPIC" CXX="g++ -fPIC" make -j4
make install
popd
export LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
echo export LD_LIBRARY_PATH=/usr/local/lib:\$LD_LIBRARY_PATH > /etc/profile.d/gdal_library_path.sh
chmod +x /etc/profile.d/gdal_library_path.sh
但是我的工作失败了,输出如下:
Created new cluster j-T8UUFEZILJYQ
Waiting for step 1 of 1 (s-3SOCF1ZPWJ575) to complete...
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
Opening ssh tunnel to resource manager...
Connect to resource manager at: http://localhost:40199/cluster
RUNNING for 16.2s
Unable to connect to resource manager
RUNNING for 48.8s
FAILED
Cluster j-T8UUFEZILJYQ is TERMINATING: Shut down as step failed
Attempting to fetch counters from logs...
Looking for step log in /mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com...
Parsing step log: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575/syslog
Counters: 9
Job Counters
Data-local map tasks=1
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Total megabyte-seconds taken by all map tasks=33988320
Total time spent by all map tasks (ms)=23603
Total time spent by all maps in occupied slots (ms)=1062135
Total time spent by all reduces in occupied slots (ms)=0
Total vcore-seconds taken by all map tasks=23603
Scanning logs for probable cause of failure...
Looking for task logs in /mnt/var/log/hadoop/userlogs/application_1463748945334_0001 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com and task/core nodes...
Parsing task syslog: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog
Parsing task stderr: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr
Probable cause of failure:
R/W/S=1749/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 HOST=null
USER=hadoop
HADOOP_USER=null
last tool output: |null|
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:65)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
(from lines 48-72 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog)
caused by:
+ /usr/local/ripple/venv/bin/python3 test_mrjob.py --step-num=0 --mapper
Traceback (most recent call last):
File "test_mrjob.py", line 2, in <module>
import numpy as np
ImportError: No module named 'numpy'
(from lines 31-35 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr)
while reading input from s3://data-essence/databerries-01/extract_essence_000000000001.gz
Step 1 of 1 failed
Killing our SSH tunnel (pid 1288)
Terminating cluster: j-T8UUFEZILJYQ
我以前测试过我在VM上的所有引导操作,它似乎工作正常。 发生了什么事情的线索?
更新:我尝试运行基本的Mrjob例程,另外还有一个numpy导入和相同的安装过程。我得到了同样的错误:作业失败,因为它无法导入numpy。
答案 0 :(得分:0)
我终于解决了我的问题:我必须将mrjob.conf文件中的行- sudo pip install -r req.txt#
更改为- sudo /usr/local/ripple/venv/bin/pip3 install -r req.txt#
。