Question

我很难通过我的EMR笔记本与其他库一起工作。用于EMR的AWS界面使我可以创建Jupyter笔记本并将其附加到正在运行的集群。我想在其中使用其他库。 SSH进入计算机并以ec2-user或root手动安装将不会使库对笔记本计算机可用，因为它显然使用了livy用户。引导操作会为hadoop安装东西。我无法从笔记本电脑安装，因为它的用户显然没有sudo，git等，并且它可能也不会安装到从属计算机上。

为通过EMR界面创建的笔记本安装其他库的规范方法是什么？

Answer 1

为通过EMR界面创建的笔记本安装其他库的规范方法是什么？

EMR笔记本电脑最近推出了“笔记本范围的库”，您可以使用它从公共或私有PyPI存储库在群集上安装其他Python库，并在笔记本会话中使用它。

笔记本范围的库具有以下优点：

您可以在EMR笔记本中使用库，而无需重新创建群集或将笔记本计算机重新连接到群集。
您可以将EMR笔记本的库依赖关系隔离到各个笔记本会话。从笔记本中安装的库不能干扰群集中的其他库或在其他笔记本会话中安装的库。

更多详情， https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html

技术博客： https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

Answer 2

为了举例说明，假设您在运行 EMR集群上需要librosa Python模块。我们将使用Python 2.7，因为过程更简单-保证Python 2.7位于群集上，这是EMR的默认运行时。

创建用于安装软件包的脚本：

#!/bin/bash
sudo easy_install-2.7 pip
sudo /usr/local/bin/pip2 install librosa

并将其保存到您的主目录，例如/home/hadoop/install_librosa.sh。注意名称，稍后我们将使用它。

在下一步中，您将通过受Amazon EMR docs启发的另一个脚本运行该脚本：emr_install.py。它使用AWS Systems Manager在节点上执行脚本。

import time
from boto3 import client
from sys import argv

try:
  clusterId=argv[1]
except:
  print("Syntax: emr_install.py [ClusterId]")
  import sys
  sys.exit(1)

emrclient=client('emr')

# Get list of core nodes
instances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances']
instance_list=[x['Ec2InstanceId'] for x in instances]

# Attach tag to core nodes
ec2client=client('ec2')
ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}])

ssmclient=client('ssm')

    # Run shell script to install libraries

command=ssmclient.send_command(Targets=[{"Key": "tag:environment", "Values":["coreNodeLibs"]}],
                               DocumentName='AWS-RunShellScript',
                               Parameters={"commands":["bash /home/hadoop/install_librosa.sh"]},
                               TimeoutSeconds=3600)['Command']['CommandId']

command_status=ssmclient.list_commands(
  CommandId=command,
  Filters=[
      {
          'key': 'Status',
          'value': 'SUCCESS'
      },
  ]
)['Commands'][0]['Status']

time.sleep(30)

print("Command:" + command + ": " + command_status)

要运行它：

python emr_install.py [cluster_id]

Answer 3

在这种情况下，我通常要做的是删除群集，并使用引导操作创建一个新群集。引导操作使您可以在群集上安装其他库：https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html。例如，编写以下脚本并将其保存在S3中，将允许您使用在群集顶部运行的笔记本中的datadog（至少与EMR 5.19兼容）：

#!/bin/bash -xe
#install datadog module for using in pyspark
sudo pip-3.4 install -U datadog

这是启动该集群所需运行的命令行：

aws emr create-cluster --release-label emr-5.19.0 \
--name 'EMR 5.19 test' \
--applications Name=Hadoop Name=Spark Name=Hive Name=Livy \
--use-default-roles \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \
--region eu-west-1 \
--log-uri s3://<path-to-logs> \
--configurations file://config-emr.json \
--bootstrap-actions Path=s3://<path-to-bootstrap-in-aws>,Name=InstallPythonModules

以及本地存储在您计算机上的config-emr.json：

[{
    "Classification": "spark",
    "Properties": {
    "maximizeResourceAllocation": "true"
    }
},
{
    "Classification": "spark-env",
    "Configurations": [
    {
        "Classification": "export",
        "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"
        }
    }
    ]
}]

我假设通过EMR界面创建高级集群选项时，您可以做完全相同的事情。

Answer 4

我在这方面花了很长时间，AWS 文档或支持根本没有帮助，但确实让它工作，因此您可以直接在笔记本中安装 Python 库。

如果您可以执行以下项目，那么您可以通过在单行 Jupyter 单元中运行 pip install 命令来安装库，使用 Python 运行时，就像这样

sdk.dir = /usr/local/share/android-sdk

让我很困惑的一个项目是，我可以通过 SSH 进入集群并访问互联网，ping 和 pip 都可以工作，但是笔记本无法访问，也没有任何库实际可用。相反，您需要确保笔记本可以伸出。一项很好的测试就是看看您是否可以 ping 出。结构同上，单行以 !

!pip install pandas

如果这花费的时间太长并且超时，那么您仍然需要弄清楚您的 VPN/子网规则。

以下关于集群创建的注意事项：

（步骤 1）这不适用于每个版本的 EMR。我让它在 5.30.0 上工作，但最后我检查 5.30.1 没有工作。
（第 2 步 -> 联网）您需要确保您位于私有子网中，并且您的 VPN 可以访问公共互联网。同样，不要让 SHHing 进入服务器欺骗您，笔记本要么在那里的 docker 映像中，要么在其他地方运行。唯一相关的测试是您直接从笔记本运行的测试。

一旦你有了这个工作并安装了一个包，它将适用于该集群上的任何笔记本。我有一个名为 install 的笔记本，每个包都有一行，每当我启动新集群时都会运行。

EMR笔记本安装其他库

4 个答案: