Question

大家好，我在保存DataFrame时遇到了问题。我找到了一个类似的悬而未决的问题：Saving Spark dataFrames as parquet files - no errors, but data is not being saved。我的问题是，当我测试以下代码时：

scala> import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vectors

scala> val dataset = spark.createDataFrame(
     |   Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
     | ).toDF("id", "hour", "mobile", "userFeatures", "clicked")
dataset: org.apache.spark.sql.DataFrame = [id: int, hour: int ... 3 more fields]

scala> dataset.show
+---+----+------+--------------+-------+
| id|hour|mobile|  userFeatures|clicked|
+---+----+------+--------------+-------+
|  0|  18|   1.0|[0.0,10.0,0.5]|    1.0|
+---+----+------+--------------+-------+

scala> dataset.write.parquet("/home/vitrion/out")

没有显示错误，似乎DF已保存为Parquet文件。令人惊讶的是，输出目录中没有创建任何文件。

这是我的群集配置

日志文件说：

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/03/01 12:56:53 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 51016@t630-0
18/03/01 12:56:53 INFO SignalUtils: Registered signal handler for TERM
18/03/01 12:56:53 INFO SignalUtils: Registered signal handler for HUP
18/03/01 12:56:53 INFO SignalUtils: Registered signal handler for INT
18/03/01 12:56:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/01 12:56:54 WARN Utils: Your hostname, t630-0 resolves to a loopback address: 127.0.1.1; using 192.168.239.218 instead (on interface eno1)
18/03/01 12:56:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/03/01 12:56:54 INFO SecurityManager: Changing view acls to: vitrion
18/03/01 12:56:54 INFO SecurityManager: Changing modify acls to: vitrion
18/03/01 12:56:54 INFO SecurityManager: Changing view acls groups to: 
18/03/01 12:56:54 INFO SecurityManager: Changing modify acls groups to: 
18/03/01 12:56:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(vitrion); groups with view permissions: Set(); users  with modify permissions: Set(vitrion); groups with modify permissions: Set()
18/03/01 12:56:54 INFO TransportClientFactory: Successfully created connection to /192.168.239.54:42629 after 80 ms (0 ms spent in bootstraps)
18/03/01 12:56:54 INFO SecurityManager: Changing view acls to: vitrion
18/03/01 12:56:54 INFO SecurityManager: Changing modify acls to: vitrion
18/03/01 12:56:54 INFO SecurityManager: Changing view acls groups to: 
18/03/01 12:56:54 INFO SecurityManager: Changing modify acls groups to: 
18/03/01 12:56:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(vitrion); groups with view permissions: Set(); users  with modify permissions: Set(vitrion); groups with modify permissions: Set()
18/03/01 12:56:54 INFO TransportClientFactory: Successfully created connection to /192.168.239.54:42629 after 2 ms (0 ms spent in bootstraps)
18/03/01 12:56:54 INFO DiskBlockManager: Created local directory at /tmp/spark-d749d72b-6db2-4f02-8dae-481c0ea1f68f/executor-f379929a-3a6a-4366-8983-b38e19fb9cfc/blockmgr-c6d89ef4-b22a-4344-8816-23306722d40c
18/03/01 12:56:54 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
18/03/01 12:56:54 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.239.54:42629
18/03/01 12:56:54 INFO WorkerWatcher: Connecting to worker spark://Worker@192.168.239.218:45532
18/03/01 12:56:54 INFO TransportClientFactory: Successfully created connection to /192.168.239.218:45532 after 1 ms (0 ms spent in bootstraps)
18/03/01 12:56:54 INFO WorkerWatcher: Successfully connected to spark://Worker@192.168.239.218:45532
18/03/01 12:56:54 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
18/03/01 12:56:54 INFO Executor: Starting executor ID 2 on host 192.168.239.218
18/03/01 12:56:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37178.
18/03/01 12:56:54 INFO NettyBlockTransferService: Server created on 192.168.239.218:37178
18/03/01 12:56:54 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/03/01 12:56:54 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, 192.168.239.218, 37178, None)
18/03/01 12:56:54 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, 192.168.239.218, 37178, None)
18/03/01 12:56:54 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, 192.168.239.218, 37178, None)
18/03/01 12:56:54 INFO Executor: Using REPL class URI: spark://192.168.239.54:42629/classes
18/03/01 12:57:54 INFO CoarseGrainedExecutorBackend: Got assigned task 0
18/03/01 12:57:54 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/03/01 12:57:54 INFO TorrentBroadcast: Started reading broadcast variable 0
18/03/01 12:57:55 INFO TransportClientFactory: Successfully created connection to /192.168.239.54:35081 after 1 ms (0 ms spent in bootstraps)
18/03/01 12:57:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 28.1 KB, free 8.4 GB)
18/03/01 12:57:55 INFO TorrentBroadcast: Reading broadcast variable 0 took 103 ms
18/03/01 12:57:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 76.6 KB, free 8.4 GB)
18/03/01 12:57:55 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/03/01 12:57:55 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
18/03/01 12:57:55 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/03/01 12:57:55 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
18/03/01 12:57:55 INFO CodecConfig: Compression: SNAPPY
18/03/01 12:57:55 INFO CodecConfig: Compression: SNAPPY
18/03/01 12:57:55 INFO ParquetOutputFormat: Parquet block size to 134217728
18/03/01 12:57:55 INFO ParquetOutputFormat: Parquet page size to 1048576
18/03/01 12:57:55 INFO ParquetOutputFormat: Parquet dictionary page size to 1048576
18/03/01 12:57:55 INFO ParquetOutputFormat: Dictionary is on
18/03/01 12:57:55 INFO ParquetOutputFormat: Validation is off
18/03/01 12:57:55 INFO ParquetOutputFormat: Writer version is: PARQUET_1_0
18/03/01 12:57:55 INFO ParquetOutputFormat: Maximum row group padding size is 0 bytes
18/03/01 12:57:55 INFO ParquetOutputFormat: Page size checking is: estimated
18/03/01 12:57:55 INFO ParquetOutputFormat: Min row count for page size check is: 100
18/03/01 12:57:55 INFO ParquetOutputFormat: Max row count for page size check is: 10000
18/03/01 12:57:55 INFO ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "integer",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "hour",
    "type" : "integer",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "mobile",
    "type" : "double",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "userFeatures",
    "type" : {
      "type" : "udt",
      "class" : "org.apache.spark.ml.linalg.VectorUDT",
      "pyClass" : "pyspark.ml.linalg.VectorUDT",
      "sqlType" : {
        "type" : "struct",
        "fields" : [ {
          "name" : "type",
          "type" : "byte",
          "nullable" : false,
          "metadata" : { }
        }, {
          "name" : "size",
          "type" : "integer",
          "nullable" : true,
          "metadata" : { }
        }, {
          "name" : "indices",
          "type" : {
            "type" : "array",
            "elementType" : "integer",
            "containsNull" : false
          },
          "nullable" : true,
          "metadata" : { }
        }, {
          "name" : "values",
          "type" : {
            "type" : "array",
            "elementType" : "double",
            "containsNull" : false
          },
          "nullable" : true,
          "metadata" : { }
        } ]
      }
    },
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "clicked",
    "type" : "double",
    "nullable" : false,
    "metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  required int32 id;
  required int32 hour;
  required double mobile;
  optional group userFeatures {
    required int32 type (INT_8);
    optional int32 size;
    optional group indices (LIST) {
      repeated group list {
        required int32 element;
      }
    }
    optional group values (LIST) {
      repeated group list {
        required double element;
      }
    }
  }
  required double clicked;
}

18/03/01 12:57:55 INFO CodecPool: Got brand-new compressor [.snappy]
18/03/01 12:57:55 INFO InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 84
18/03/01 12:57:55 INFO FileOutputCommitter: Saved output of task 'attempt_20180301125755_0000_m_000000_0' to file:/home/vitrion/out/_temporary/0/task_20180301125755_0000_m_000000
18/03/01 12:57:55 INFO SparkHadoopMapRedUtil: attempt_20180301125755_0000_m_000000_0: Committed
18/03/01 12:57:55 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1967 bytes result sent to driver`

你能帮我解决这个问题吗？

谢谢

Answer 1

你是否尝试过不使用Vector？我在过去看到过，复杂的数据结构会导致写作问题。

当我将一个DataFrame写入Parquet文件时，不会显示任何错误，也不会创建任何文件

1 个答案: