Question

这是一个简单的测试程序。这显然是一个很小的测试数据程序。

from pyspark.sql.types import Row
from pyspark.sql.types import *
import pyspark.sql.functions as spark_functions

schema = StructType([
    StructField("cola", StringType()),
    StructField("colb", IntegerType()),
])

rows = [
    Row("alpha", 1),
    Row("beta", 2),
    Row("gamma", 3),
    Row("delta", 4)
]

data_frame = spark.createDataFrame(rows, schema)

print("count={}".format(data_frame.count()))

data_frame.write.save("s3a://my-bucket/test_data.parquet", mode="overwrite")

print("done")

这失败了：

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories.
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:366)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)

这是在带有S3存储的Amazon EMR上运行的。有足够的磁盘空间。谁能解释一下？

Answer 1

在EMR上使用Spark 2.2时遇到了相同的错误。 fs.s3a.fast.upload=true和fs.s3a.buffer.dir="/home/hadoop,/tmp"（或与此相关的任何其他文件夹）设置对我没有帮助。看来我的问题与洗牌空间有关。

我必须在spark-submit / spark-shell中添加--conf spark.shuffle.service.enabled=true才能解决此错误。

火花。简单＆＃34;任何本地目录中都没有空间。＆＃34;

1 个答案: