火花处理碳水化合物

时间:2018-05-17 07:24:03

标签: scala apache-spark amazon-s3 carbon-data

以下是我尝试用于在S3中创建carbondata表的代码段。但是,尽管在hadoop配置中设置aws凭据,它仍然会抱怨未设置密钥和访问密钥。这是什么问题?

 import org.apache.spark.sql.CarbonSession._
 import org.apache.spark.sql.CarbonSession._
 val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("s3n://url")
carbon.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId","<accesskey>")
   carbon.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<secretaccesskey>")
carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string,name string,city string,age Int) STORED BY 'carbondata'")

上一个命令产生错误:

  

java.lang.IllegalArgumentException:AWS Access密钥ID和密钥   必须将访问密钥指定为用户名或密码   (分别)s3n URL,或者设置fs.s3n.awsAccessKeyId   或者fs.s3n.awsSecretAccessKey属性(分别)

Spark Version : 2.2.1
Command used to start spark-shell:
$SPARK_PATH/bin/spark-shell --jars /localpath/jar/apache-carbondata-1.3.1-bin-spark2.2.1-hadoop2.7.2/apache-carbondata-1.3.1-bin-spark2.2.1-hadoop2.7.2.jar,/localpath/jar/spark-avro_2.11-4.0.0.jar --packages com.amazonaws:aws-java-sdk-pom:1.9.22,org.apache.hadoop:hadoop-aws:2.7.2,org.slf4j:slf4j-simple:1.7.21,asm:asm:3.2,org.xerial.snappy:snappy-java:1.1.7.1,com.databricks:spark-avro_2.11:4.0.0

更新

发现S3支持仅适用于1.4.0 RC1。所以我构建了RC1并测试了下面的代码。但我似乎还在遇到问题。任何帮助赞赏。 代码:

import org.apache.spark.sql.CarbonSession._
import org.apache.hadoop.fs.s3a.Constants.{ACCESS_KEY, ENDPOINT, SECRET_KEY}
import org.apache.spark.sql.SparkSession
import org.apache.carbondata.core.constants.CarbonCommonConstants
object sample4 {
def main(args: Array[String]) {
val (accessKey, secretKey, endpoint) = getKeyOnPrefix("s3n://")
//val rootPath = new File(this.getClass.getResource("/").getPath
//                            + "../../../..").getCanonicalPath
val path = "/localpath/sample/data1.csv"
val spark = SparkSession
      .builder()
      .master("local")
      .appName("S3UsingSDKExample")
      .config("spark.driver.host", "localhost")
      .config(accessKey, "<accesskey>")
      .config(secretKey, "<secretkey>")
      //.config(endpoint, "s3-us-east-1.amazonaws.com")
      .getOrCreateCarbonSession()
      spark.sql("Drop table if exists carbon_table")

    spark.sql(
      s"""
         | CREATE TABLE if not exists carbon_table(
         | shortField SHORT,
         | intField INT,
         | bigintField LONG,
         | doubleField DOUBLE,
         | stringField STRING,
         | timestampField TIMESTAMP,
         | decimalField DECIMAL(18,2),
         | dateField DATE,
         | charField CHAR(5),
         | floatField FLOAT
         | )
         | STORED BY 'carbondata'
         | LOCATION 's3n://bucketName/table/carbon_table'
         | TBLPROPERTIES('SORT_COLUMNS'='', 'DICTIONARY_INCLUDE'='dateField, charField')
       """.stripMargin)

}


def getKeyOnPrefix(path: String): (String, String, String) = {
    val endPoint = "spark.hadoop." + ENDPOINT
    if (path.startsWith(CarbonCommonConstants.S3A_PREFIX)) {
      ("spark.hadoop." + ACCESS_KEY, "spark.hadoop." + SECRET_KEY, endPoint)
    } else if (path.startsWith(CarbonCommonConstants.S3N_PREFIX)) {
      ("spark.hadoop." + CarbonCommonConstants.S3N_ACCESS_KEY,
        "spark.hadoop." + CarbonCommonConstants.S3N_SECRET_KEY, endPoint)
    } else if (path.startsWith(CarbonCommonConstants.S3_PREFIX)) {
      ("spark.hadoop." + CarbonCommonConstants.S3_ACCESS_KEY,
        "spark.hadoop." + CarbonCommonConstants.S3_SECRET_KEY, endPoint)
    } else {
      throw new Exception("Incorrect Store Path")
    }
  }
  def getSparkMaster(args: Array[String]): String = {
    if (args.length == 6) args(5)
    else if (args(3).contains("spark:") || args(3).contains("mesos:")) args(3)
    else "local"
  }
}

错误:

18/05/17 12:23:22 ERROR SegmentStatusManager: main Failed to read metadata of load
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: Request Error: Empty key

我也试过反对示例代码(尝试过s3,s3n,s3a协议):

https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/S3Example.scala

Ran as:

  

S3Example.main(阵列(&#34; ACCESSKEY&#34;&#34;秘密密钥&#34;&#34; S3:// bucketName /路径/ carbon_table&#34;&#34; {{ 3}}&#34;&#34;本地&#34))

错误堆栈跟踪:

  

org.apache.hadoop.fs.s3.S3Exception:   org.jets3t.service.S3ServiceException:请求错误:空键   org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:175)     在   org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:221)     at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)at   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)     在   org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)     在com.sun.proxy。$ Proxy21.retrieveINode(未知来源)at   org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:340)     在org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)at   org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.isFileExist(AbstractDFSCarbonFile.java:426)     在   org.apache.carbondata.core.datastore.impl.FileFactory.isFileExist(FileFactory.java:201)     在   org.apache.carbondata.core.statusmanager.SegmentStatusManager.readTableStatusFile(SegmentStatusManager.java:246)     在   org.apache.carbondata.core.statusmanager.SegmentStatusManager.readLoadMetadata(SegmentStatusManager.java:197)     在   org.apache.carbondata.core.cache.dictionary.ManageDictionaryAndBTree.clearBTreeAndDictionaryLRUCache(ManageDictionaryAndBTree.java:101)     在   org.apache.spark.sql.hive.CarbonFileMetastore.dropTable(CarbonFileMetastore.scala:460)     在   org.apache.spark.sql.execution.command.table.CarbonCreateTableCommand.processMetadata(CarbonCreateTableCommand.scala:148)     在   org.apache.spark.sql.execution.command.MetadataCommand.run(package.scala:68)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)     在org.apache.spark.sql.Dataset。(Dataset.scala:183)at   org.apache.spark.sql.CarbonSession $$ anonfun $ $ SQL 1.适用(CarbonSession.scala:107)     在   org.apache.spark.sql.CarbonSession $$ anonfun $ SQL $ 1.适用(CarbonSession.scala:96)     在   org.apache.spark.sql.CarbonSession.withProfiler(CarbonSession.scala:144)     在org.apache.spark.sql.CarbonSession.sql(CarbonSession.scala:94)at   $ line19。$ read $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $ S3Example $ .main(:68)at $ line26. $ read $$ iw $$ iw $$ IW $$ IW $$ IW $$ IW $$ IW $$ IW。(31)     在$ line.26。$ read $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw。(:36)at   $ line.26。$ read $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw。(:38)at   $ line.26。$ read $$ iw $$ iw $$ iw $$ iw $$ iw。(:40)at   $ line.26。$ read $$ iw $$ iw $$ iw $$ iw。(:42)at   $ line.26。$ read $$ iw $$ iw $$ iw。(:44)at   $ line.26。$ read $$ iw $$ iw。(:46)at   $ line26。$ read $$ iw。(:48)at   $ line.26。$ read。(:50)at   $ line.26。$ read $。(:54)at   $ line26。$ read $。()at   $ line.26。$ eval $。$ print $ lzycompute(:7)at   $ line26。$ eval $。$ print(:6)at $ line26. $ eval。$ print()     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   scala.tools.nsc.interpreter.IMain $ ReadEvalPrint.call(IMain.scala:786)     在   scala.tools.nsc.interpreter.IMain $ Request.loadAndRun(IMain.scala:1047)     在   scala.tools.nsc.interpreter.IMain $ WrappedRequest $$ anonfun $ loadAndRunReq $ 1.适用(IMain.scala:638)     在   scala.tools.nsc.interpreter.IMain $ WrappedRequest $$ anonfun $ loadAndRunReq $ 1.适用(IMain.scala:637)     在   scala.reflect.internal.util.ScalaClassLoader $ class.asContext(ScalaClassLoader.scala:31)     在   scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)     在   scala.tools.nsc.interpreter.IMain $ WrappedRequest.loadAndRunReq(IMain.scala:637)     在scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)at at   scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)at at   scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)     在scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)at at   scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)at at   scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)at at   scala.tools.nsc.interpreter.ILoop $$ anonfun $过程$ 1.适用$ MCZ $ SP(ILoop.scala:923)     在   scala.tools.nsc.interpreter.ILoop $$ anonfun $过程$ 1.适用(ILoop.scala:909)     在   scala.tools.nsc.interpreter.ILoop $$ anonfun $过程$ 1.适用(ILoop.scala:909)     在   scala.reflect.internal.util.ScalaClassLoader $ .savingContextLoader(ScalaClassLoader.scala:97)     在scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)at at   org.apache.spark.repl.Main $ .doMain(Main.scala:74)at   org.apache.spark.repl.Main $ .main(Main.scala:54)at   org.apache.spark.repl.Main.main(Main.scala)at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   org.apache.spark.deploy.SparkSubmit $ .ORG $阿帕奇$火花$部署$ SparkSubmit $$ runMain(SparkSubmit.scala:775)     在   org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1(SparkSubmit.scala:180)     在org.apache.spark.deploy.SparkSubmit $ .submit(SparkSubmit.scala:205)     在org.apache.spark.deploy.SparkSubmit $ .main(SparkSubmit.scala:119)     在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)导致   by:org.jets3t.service.S3ServiceException:请求错误:空键     在org.jets3t.service.S3Service.getObject(S3Service.java:1470)at   org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:163)

我传递的任何论据都是错误的。 我可以使用aws cli访问s3路径:

  

aws s3 ls s3:// bucketName / path

存在于S3中。

2 个答案:

答案 0 :(得分:3)

您可以使用此示例https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/S3Example.scala

进行尝试

在创建carbonSession后,您必须首先提供aws凭据属性。

如果您已经创建了sparkContext而没有提供aws属性。然后,即使将它提供给carbonContext,它也不会获取这些属性。

答案 1 :(得分:0)

嗨vikas查看你的异常空键只是意味着你的acesss密钥和密钥没有在碳会话中绑定,因为当我们给出s3实现时,我们编写逻辑,如果任何密钥不是由用户提供的话那么它然后他们的价值应该是空的

所以让事情变得简单 首先使用此命令构建碳数据jar

mvn -Pspark-2.1清洁包 然后使用此命令执行spark submit

./ spark-submit --jars file:///home/anubhav/Downloads/softwares/spark-2.2.1-bin-hadoop2.7/carbonlib/apache-carbondata-1.4.0-SNAPSHOT-bin- spark2.2.1-hadoop2.7.2.jar --class org.apache.carbondata.examples.S3Example /home/anubhav/Documents/carbondata/carbondata/carbondata/examples/spark2/target/carbondata-examples-spark2-1.4.0- SNAPSHOT.jar当地

用我的替换我的jar路径,看它应该工作,它为我工作