Spark - 无法将数据帧保存到磁盘

时间:2018-05-11 21:27:47

标签: apache-spark pyspark apache-spark-sql spark-dataframe parquet

我在Hive目录的独立模式下运行Spark。我试图从外部文档加载数据,然后以Parquet格式将其保存回磁盘。

rdd = sc \
    .textFile('/data/source.txt', NUM_SLICES) \
    .map(lambda x: (x[:5], x[6:12], gensim.utils.simple_preprocess(x[13:]))) 

schema = StructType([
    StructField('c1', StringType(), False),
    StructField('c2', StringType(), False),
    StructField('c3', ArrayType(StringType(), True), False),
])

data = sql_context.createDataFrame(rdd, schema)

data.write.mode('overwrite').parquet('/data/some_dir')

当我尝试读回该文件时,它失败了:

AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'

看起来它只是无法解析位置或文件属性。

现在,如果我查看所有3个工作节点上的位置,它看起来像:

clush -ab 'locate some_file'
---------------
master
---------------
/data/some_file
/data/some_file/._SUCCESS.crc
/data/some_file/_SUCCESS
---------------
worker1
---------------
/data/some_file
/data/some_file/_temporary
/data/some_file/_temporary/0
/data/some_file/_temporary/0/_temporary
/data/some_file/_temporary/0/task_20180511211832_0010_m_000000
/data/some_file/_temporary/0/task_20180511211832_0010_m_000039
/data/some_file/_temporary/0/task_20180511211832_0010_m_000000/.part-00000-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000000/part-00000-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet
/data/some_file/_temporary/0/task_20180511211832_0010_m_000039/.part-00039-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000039/part-00039-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet
---------------
worker2
---------------
/data/some_file
/data/some_file/_temporary
/data/some_file/_temporary/0
/data/some_file/_temporary/0/_temporary
/data/some_file/_temporary/0/task_20180511211832_0010_m_000011
/data/some_file/_temporary/0/task_20180511211832_0010_m_000017
/data/some_file/_temporary/0/task_20180511211832_0010_m_000029
/data/some_file/_temporary/0/task_20180511211832_0010_m_000038
/data/some_file/_temporary/0/task_20180511211832_0010_m_000011/.part-00011-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000011/part-00011-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet
/data/some_file/_temporary/0/task_20180511211832_0010_m_000017/.part-00017-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000017/part-00017-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet
/data/some_file/_temporary/0/task_20180511211832_0010_m_000029/.part-00029-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000029/part-00029-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet
/data/some_file/_temporary/0/task_20180511211832_0010_m_000038/.part-00038-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000038/part-00038-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet
---------------
worker3
---------------
/data/some_file
/data/some_file/_temporary
/data/some_file/_temporary/0
/data/some_file/_temporary/0/_temporary
/data/some_file/_temporary/0/task_20180511211832_0010_m_000040
/data/some_file/_temporary/0/task_20180511211832_0010_m_000043
/data/some_file/_temporary/0/task_20180511211832_0010_m_000046
/data/some_file/_temporary/0/task_20180511211832_0010_m_000040/.part-00040-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000040/part-00040-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet
/data/some_file/_temporary/0/task_20180511211832_0010_m_000043/.part-00043-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000043/part-00043-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet
/data/some_file/_temporary/0/task_20180511211832_0010_m_000046/.part-00046-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet.crc
/data/some_file/_temporary/0/task_20180511211832_0010_m_000046/part-00046-1b2764a6-28a3-4ba2-9493-766074eef4d5-c000.snappy.parquet

我无法理解为什么将它保存到'_temporary'而不是永久文件夹。

如果您需要其他背景信息,请与我们联系。

由于

1 个答案:

答案 0 :(得分:1)

TL; DR 要以分布式模式保存和加载数据,您需要一个分布式文件系统。本地存储是不够的。

  

我无法理解为什么将它保存到'_temporary'而不是永久文件夹。

那是因为你没有分布式文件系统。在这种情况下,每个执行者都可以完成自己的部分,但Spark将无法正确完成工作。

此外,由于每个执行程序只能访问结果的一部分,因此无法使用Spark加载数据。