Question

我正在尝试使用python Mrjob在Amazon EMR上运行map reduce作业，我在安装依赖项时遇到了一些问题。

我的mrjob代码：

from mrjob.job import MRJob
import re
from normalize import *
from compute_features import *

#Some code

normalize和compute_features文件有很多依赖项，包括numpy，scipy，sklearn，fiona，......

我的mrjob.conf文件：

runners:
    emr:
        aws_access_key_id: xxxx
        aws_secret_access_key: xxxx
        aws_region: eu-west-1
        ec2_key_pair: EMR
        ec2_key_pair_file: /Users/antoinerigoureau/Documents/emr.pem
        ssh_tunnel: true
        ec2_instance_type: m3.xlarge
        ec2_master_instance_type: m3.xlarge
        num_ec2_instances: 1
        cmdenv:
            TZ: Europe/Paris
        bootstrap_python: false
        bootstrap:
        - curl -s https://s3-eu-west-1.amazonaws.com/data-essence/utils/bootstrap.sh | sudo bash -s
        - source /usr/local/ripple/venv/bin/activate
        - sudo pip install -r req.txt#
        upload_archives:
        - /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data
        upload_files:
        - /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/normalize.py
        - /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/compute_features.py
        python_bin: /usr/local/ripple/venv/bin/python3
        enable_emr_debugging: True
        setup:
        - source /usr/local/ripple/venv/bin/activate
    local:
        upload_archives:
        - /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data

我的bootstrap.sh文件是：

#/bin/bash

set -e
set -x
yum update -y

# install yum packages
yum install -y gcc\
    geos-devel\
    gcc-c++\
    atlas-sse3-devel\
    lapack-devel\
    libpng-devel\
    freetype-devel\
    zlib-devel\
    ncurses-devel\
    readline-devel\
    patch\
    make\
    libtool\
    curl\
    openssl-devel\
    screen

pushd $HOME

# install python

rm -rf Python-3.5.1.tgz
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz &&\
    tar -xzvf Python-3.5.1.tgz
pushd Python-3.5.1
./configure
make -j 4
make install
popd

export PATH=/usr/local/bin:$PATH
echo export PATH=/usr/local/bin:\$PATH: > /etc/profile.d/usr_local_path.sh
chmod +x /etc/profile.d/usr_local_path.sh

pip3.5 install --upgrade pip virtualenv

mkdir -p /usr/local/ripple/venv
virtualenv /usr/local/ripple/venv
source /usr/local/ripple/venv/bin/activate

# install gdal

rm -rf gdal191.zip
wget http://download.osgeo.org/gdal/gdal191.zip &&\
    unzip gdal191.zip

#
# Here is the trick I had to add to get around the following -fPIC error
# /usr/bin/ld: /root/gdal-1.9.1/frmts/o/.libs/aaigriddataset.o: relocation R_X86_64_32S against `vtable for AAIGRasterBand' can not be used when making a shared object; recompile with -fPIC
#

pushd gdal-1.9.1
./configure
CC="gcc -fPIC" CXX="g++ -fPIC" make -j4
make install
popd

export LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
echo export LD_LIBRARY_PATH=/usr/local/lib:\$LD_LIBRARY_PATH > /etc/profile.d/gdal_library_path.sh
chmod +x /etc/profile.d/gdal_library_path.sh

但是我的工作失败了，输出如下：

Created new cluster j-T8UUFEZILJYQ
Waiting for step 1 of 1 (s-3SOCF1ZPWJ575) to complete...
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  Opening ssh tunnel to resource manager...
  Connect to resource manager at: http://localhost:40199/cluster
  RUNNING for 16.2s
Unable to connect to resource manager
  RUNNING for 48.8s
  FAILED
Cluster j-T8UUFEZILJYQ is TERMINATING: Shut down as step failed
Attempting to fetch counters from logs...
Looking for step log in /mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com...
  Parsing step log: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575/syslog
Counters: 9
    Job Counters 
        Data-local map tasks=1
        Failed map tasks=4
        Launched map tasks=4
        Other local map tasks=3
        Total megabyte-seconds taken by all map tasks=33988320
        Total time spent by all map tasks (ms)=23603
        Total time spent by all maps in occupied slots (ms)=1062135
        Total time spent by all reduces in occupied slots (ms)=0
        Total vcore-seconds taken by all map tasks=23603
Scanning logs for probable cause of failure...
Looking for task logs in /mnt/var/log/hadoop/userlogs/application_1463748945334_0001 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com and task/core nodes...
  Parsing task syslog: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog
  Parsing task stderr: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr
Probable cause of failure:

R/W/S=1749/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 HOST=null
USER=hadoop
HADOOP_USER=null
last tool output: |null|

java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:345)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:65)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)

(from lines 48-72 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog)

caused by:

+ /usr/local/ripple/venv/bin/python3 test_mrjob.py --step-num=0 --mapper
Traceback (most recent call last):
  File "test_mrjob.py", line 2, in <module>
    import numpy as np
ImportError: No module named 'numpy'

(from lines 31-35 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr)

while reading input from s3://data-essence/databerries-01/extract_essence_000000000001.gz


Step 1 of 1 failed
Killing our SSH tunnel (pid 1288)
Terminating cluster: j-T8UUFEZILJYQ

我以前测试过我在VM上的所有引导操作，它似乎工作正常。发生了什么事情的线索？

更新：我尝试运行基本的Mrjob例程，另外还有一个numpy导入和相同的安装过程。我得到了同样的错误：作业失败，因为它无法导入numpy。

Answer 1

我终于解决了我的问题：我必须将mrjob.conf文件中的行- sudo pip install -r req.txt#更改为- sudo /usr/local/ripple/venv/bin/pip3 install -r req.txt#。

I got my answer there:

使用python Mrjob引导对Amazon EMR的依赖关系

1 个答案: