使用Terraform在EMR上启用Presto / Spark胶目录的选项

时间:2019-02-28 05:05:35

标签: terraform amazon-emr terraform-provider-aws

想知道是否支持在EMR上运行时为Presto / Spark启用aws胶目录。在文档中找不到任何内容。

2 个答案:

答案 0 :(得分:1)

以下AWS文档讨论了如何在Amazon EMR上将Apache Spark和Hive与AWS Glue数据目录结合使用,以及将AWS Glue数据目录用作Presto(Amazon EMR版本5.10.0及更高版本)的默认Hive元存储。希望您正在寻找这个吗?

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.html

https://aws.amazon.com/about-aws/whats-new/2017/08/use-apache-spark-and-hive-on-amazon-emr-with-the-aws-glue-data-catalog/

也请检查此SO链接以了解EMR上的一些胶目录:

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

答案 1 :(得分:0)

从以上答案提供的链接中,我能够如下对terraform代码进行建模:

使用以下内容创建configuration.json.tpl

[{
       "Classification": "spark-hive-site",
       "Properties": {
         "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
       }
     }
]

使用您的Terraform代码从上述模板创建模板

data "template_file" "cluster_1_configuration" {
  template = "${file("${path.module}/templates/configuration.json.tpl")}"
}

然后将群集设置为:

resource "aws_emr_cluster" "cluster_1" {
  name          = "${var.cluster_name}-1"
  release_label = "emr-5.21.0"
  applications  = ["Spark", "Zeppelin", "Hadoop","Sqoop"]
  log_uri       = "s3n://${var.cluster_name}/logs/"
  configurations = "${data.template_file.cluster_1_configuration.rendered}"
  ...
}

胶水现在应该可以在Spark上工作了,您可以通过从spark-shell调用spark.catalog.listDatabases()。show()来验证这一点。