将Spark作为单独文件直接将numpy数组写入s3的最佳方法是什么

时间:2018-11-05 06:24:43

标签: numpy amazon-s3 pyspark

我在内存中有一个numpy数组列表,作为spark应用程序中RDD的一部分。我想做的是将每个rdd(即内容数组)另存为s3文件。因此,在s3中,我将为RDD中的每个值创建一个.npy文件。我不想创建任何中间文件,因为这会使应用程序变慢。

我已经查看了该帖子how to write .npy file to s3 directly? 。但是,当我尝试在spark应用程序中从AWS EMR运行此程序时,出现以下错误:-

OSError: [Errno [Errno 13] Permission denied: '/home/.config'] <function subimport at 0x7f87e0167320>: ('cottoncandy',)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

如何解决这个问题?

0 个答案:

没有答案