Question

我正在尝试在Amazon EMR集群上配置Stocator以访问Amazon s3上的数据。我发现资源表明这应该是可能的，但是关于如何使其工作的细节很少。

当我启动EMR集群时，我使用以下配置：

{
    "classification": "core-site",
    "properties": {
        "fs.stocator.scheme.list": "cos",
        "fs.cos.impl": "com.ibm.stocator.fs.ObjectStoreFileSystem",
        "fs.stocator.cos.impl": "com.ibm.stocator.fs.cos.COSAPIClient",
        "fs.stocator.cos.scheme":"cos"
    }
}

然后我尝试使用cos://mybucket.service/myfile

访问文件

由于缺少凭证而产生错误。

我使用以下内容将spark-shell中的凭据添加到属性中：

val credentials = new com.amazonaws.auth.DefaultAWSCredentialsProviderChain().getCredentials
sc.hadoopConfiguration.set("fs.cos.service.access.key",credentials.getAWSAccessKeyId)
sc.hadoopConfiguration.set("fs.cos.service.secret.key",credentials.getAWSSecretKey)

现在，当我尝试访问cos://mybucket.service/myfile时，我收到错误：org.apache.spark.sql.AnalysisException: Path does not exist:。

使用s3://mybucket/myfile访问文件有效，因为它没有使用Stocator。也可以通过amazon CLI访问该文件。

是否有任何在线资源详细介绍如何Stocator使用AWS？

有没有人自己成功完成此操作，您可以分享您的配置吗？

Answer 1

您可能只想联系Gil Vernik并征求意见。确保它与EMR S3一致性语义一起使用;我相信它应该。
Hadoop 3.1有自己的high performance committers，可能比Stocator快。（但我would say that，不是吗？）
该代码的部分来源来自Netflix S3A committer。

我会使用netflix，因为我确信它在那里运作良好。

如何在Amazon EMR上配置Stocator

1 个答案: