如何在AWS SageMaker中保存S3中的镶木地板?

时间:2018-03-30 18:28:26

标签: amazon-web-services apache-spark hadoop amazon-s3 amazon-sagemaker

我想将AWS SageMaker中的Spark DataFrame保存到S3。在Notebook中,我跑了

myDF.write.mode('overwrite').parquet("s3a://my-bucket/dir/dir2/")

我得到了

  

Py4JJavaError:调用o326.parquet时发生错误。 :   java.lang.RuntimeException:java.lang.ClassNotFoundException:Class   找不到org.apache.hadoop.fs.s3native.NativeS3FileSystem   org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)     在   org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)     在   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)     在org.apache.hadoop.fs.FileSystem.access $ 200(FileSystem.java:94)at   org.apache.hadoop.fs.FileSystem $ Cache.getInternal(FileSystem.java:2703)     在org.apache.hadoop.fs.FileSystem $ Cache.get(FileSystem.java:2685)     在org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)at   org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)at   org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:394)     在   org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)     在   org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.适用(SparkPlan.scala:117)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.适用(SparkPlan.scala:117)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ $的executeQuery 1.适用(SparkPlan.scala:138)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在   org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)     在   org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)     在   org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:92)     在   org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)     在   org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)     在   org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)     在   org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)     在   org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:508)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at   py4j.Gateway.invoke(Gateway.java:280)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:214)at   java.lang.Thread.run(Thread.java:745)引起:   java.lang.ClassNotFoundException:Class   找不到org.apache.hadoop.fs.s3native.NativeS3FileSystem   org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)     在   org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)

我应该如何在Notebook中正确完成?非常感谢!

1 个答案:

答案 0 :(得分:0)

SageMaker笔记本实例未运行Spark代码,并且它没有您尝试调用的Hadoop或其他Java类。

您通常在SageMaker python库(如Pandas)中使用Jupyter笔记本,并且可以使用它来编写镶木地板文件(例如,https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_parquet.html)。

另一种选择是从Jupyter笔记本连接到现有(或新的)Spark群集,并在那里远程执行命令。有关如何设置此连接的文档,请参阅此处:https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/