Question

我从一个单独的ec2实例中使用boto3启动一个EMR集群，并使用如下所示的引导脚本：

#!/bin/bash
############################################################################
#For all nodes including master                              #########
############################################################################

wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
bash Anaconda3-2019.10-Linux-x86_64.sh -b -p /mnt1/anaconda3

export PATH=/mnt1/anaconda3/bin:$PATH
echo "export PATH="/mnt1/anaconda3/bin:$PATH"" >> ~/.bash_profile

sudo sed -i -e '$a\export PYSPARK_PYTHON=/mnt1/anaconda3/bin/python' /etc/spark/conf/spark-env.sh
echo "export PYSPARK_PYTHON="/mnt1/anaconda3/bin/python3"" >> ~/.bash_profile

conda install -c conda-forge -y shap
conda install -c conda-forge -y lightgbm
conda install -c anaconda -y numpy
conda install -c anaconda -y pandas
conda install -c conda-forge -y pyarrow
conda install -c anaconda -y boto3

############################################################################
#For master                                                #########
############################################################################

if [ `grep 'isMaster' /mnt/var/lib/info/instance.json | awk -F ':' '{print $2}' | awk -F ',' '{print $1}'` = 'true' ]; then

sudo sed -i -e '$a\export PYSPARK_PYTHON=/mnt1/anaconda3/bin/python' /etc/spark/conf/spark-env.sh

echo "export PYSPARK_PYTHON="/mnt1/anaconda3/bin/python3"" >> ~/.bash_profile

sudo yum -y install git-core

conda install -c conda-forge -y jupyterlab
conda install -y jupyter
conda install -c conda-forge -y s3fs
conda install -c conda-forge -y nodejs

pip install spark-df-profiling


jupyter labextension install jupyterlab_filetree
jupyter labextension install @jupyterlab/toc

fi

然后我使用add_job_flow_steps以编程方式向正在运行的集群添加一个步骤

action = conn.add_job_flow_steps(JobFlowId=curr_cluster_id, Steps=layer_function_steps)

该步骤是完美形成的火花提交。

在导入的python文件之一中，我导入了boto3。我得到的错误是

ImportError: No module named boto3

很显然，我正在安装此库。如果我通过SSH进入主节点并运行

python
import boto3

它工作正常。自从我执行conda安装以来，使用已安装的库的spark-submit是否存在某种问题？

Answer 1

AWS有一个项目（AWS Data Wrangler），可帮助启动EMR。

此代码段应该可以在启用Python 3的情况下启动集群：

import awswrangler as wr

cluster_id = wr.emr.create_cluster(
    cluster_name="wrangler_cluster",
    logging_s3_path=f"s3://BUCKET_NAME/emr-logs/",
    emr_release="emr-5.28.0",
    subnet_id="SUBNET_ID",
    emr_ec2_role="EMR_EC2_DefaultRole",
    emr_role="EMR_DefaultRole",
    instance_type_master="m5.xlarge",
    instance_type_core="m5.xlarge",
    instance_type_task="m5.xlarge",
    instance_ebs_size_master=50,
    instance_ebs_size_core=50,
    instance_ebs_size_task=50,
    instance_num_on_demand_master=1,
    instance_num_on_demand_core=1,
    instance_num_on_demand_task=1,
    instance_num_spot_master=0,
    instance_num_spot_core=1,
    instance_num_spot_task=1,
    spot_bid_percentage_of_on_demand_master=100,
    spot_bid_percentage_of_on_demand_core=100,
    spot_bid_percentage_of_on_demand_task=100,
    spot_provisioning_timeout_master=5,
    spot_provisioning_timeout_core=5,
    spot_provisioning_timeout_task=5,
    spot_timeout_to_on_demand_master=True,
    spot_timeout_to_on_demand_core=True,
    spot_timeout_to_on_demand_task=True,
    python3=True,                                        # Relevant argument
    spark_glue_catalog=True,
    hive_glue_catalog=True,
    presto_glue_catalog=True,
    bootstraps_paths=["s3://BUCKET_NAME/bootstrap.sh"],  # Relevant argument
    debugging=True,
    applications=["Hadoop", "Spark", "Ganglia", "Hive"],
    visible_to_all_users=True,
    key_pair_name=None,
    spark_jars_path=[f"s3://...jar"],
    maximize_resource_allocation=True,
    keep_cluster_alive_when_no_steps=True,
    termination_protected=False,
    spark_pyarrow=True,                                  # Relevant argument
    tags={
        "foo": "boo"
    }
)

bootstrap.sh内容：

#!/usr/bin/env bash
set -e

echo "Installing Python libraries..."
sudo pip-3.6 install -U awswrangler
sudo pip-3.6 install -U LIBRARY1
sudo pip-3.6 install -U LIBRARY2
...

Spark提交带有Anaconda安装的python库的AWS EMR

1 个答案: