Question

我收到一个非常大的gzip文件并使用Spark / Python进行处理。但是，在某些情况下，Spark似乎没有解压缩文件。特别令人困惑的是，我可以在同一个文件中使用.count()，而不是.repartition()。

一些简化的示例代码：

from __future__ import print_function

import sys

import pyspark


def main():
    sc = pyspark.SparkContext()
    asText = sc.textFile(sys.argv[1])
    print("COUNT: ", asText.count())
    partitioned = asText.repartition(2)
    print("PARTITIONED: ", partitioned.collect())


if __name__ == '__main__':
    main()

结果，简单的输入文件和额外的日志行被省略：

$ yes | head -n 10 | gzip >y.gz
$ spark-submit test.py y.gz
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/03/28 16:43:32 INFO SparkContext: Running Spark version 2.1.0
17/03/28 16:43:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[...]
17/03/28 16:43:35 INFO CodecPool: Got brand-new decompressor [.gz]
17/03/28 16:43:36 INFO PythonRunner: Times: total = 486, boot = 471, init = 15, finish = 0
17/03/28 16:43:36 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1652 bytes result sent to driver
17/03/28 16:43:36 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 879 ms on localhost (executor driver) (1/1)
17/03/28 16:43:36 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/28 16:43:36 INFO DAGScheduler: ResultStage 0 (count at /Users/changed/test.py:11) finished in 0.927 s
17/03/28 16:43:36 INFO DAGScheduler: Job 0 finished: count at /Users/changed/test.py:11, took 1.082310 s
COUNT:  10
17/03/28 16:43:36 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.3.98:58752 in memory (size: 3.8 KB, free: 366.3 MB)
17/03/28 16:43:36 INFO SparkContext: Starting job: collect at /Users/changed/test.py:13
17/03/28 16:43:36 INFO DAGScheduler: Registering RDD 4 (coalesce at NativeMethodAccessorImpl.java:0)
17/03/28 16:43:36 INFO DAGScheduler: Got job 1 (collect at /Users/changed/test.py:13) with 2 output partitions
17/03/28 16:43:36 INFO DAGScheduler: Final stage: ResultStage 2 (collect at /Users/changed/test.py:13)
17/03/28 16:43:36 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
17/03/28 16:43:36 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
17/03/28 16:43:36 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[4] at coalesce at NativeMethodAccessorImpl.java:0), which has no missing parents
17/03/28 16:43:36 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 6.8 KB, free 366.0 MB)
17/03/28 16:43:36 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.1 KB, free 366.0 MB)
17/03/28 16:43:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.3.98:58752 (size: 4.1 KB, free: 366.3 MB)
17/03/28 16:43:36 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:996
17/03/28 16:43:36 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[4] at coalesce at NativeMethodAccessorImpl.java:0)
17/03/28 16:43:36 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/03/28 16:43:36 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 6054 bytes)
17/03/28 16:43:36 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
17/03/28 16:43:36 INFO HadoopRDD: Input split: file:/Users/changed/y.gz:0+24
17/03/28 16:43:36 INFO CodecPool: Got brand-new decompressor [.gz]
17/03/28 16:43:37 INFO PythonRunner: Times: total = 4, boot = -288, init = 292, finish = 0
17/03/28 16:43:37 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2051 bytes result sent to driver
17/03/28 16:43:37 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 377 ms on localhost (executor driver) (1/1)
17/03/28 16:43:37 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/03/28 16:43:37 INFO DAGScheduler: ShuffleMapStage 1 (coalesce at NativeMethodAccessorImpl.java:0) finished in 0.379 s
17/03/28 16:43:37 INFO DAGScheduler: looking for newly runnable stages
17/03/28 16:43:37 INFO DAGScheduler: running: Set()
17/03/28 16:43:37 INFO DAGScheduler: waiting: Set(ResultStage 2)
17/03/28 16:43:37 INFO DAGScheduler: failed: Set()
17/03/28 16:43:37 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[7] at coalesce at NativeMethodAccessorImpl.java:0), which has no missing parents
17/03/28 16:43:37 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.7 KB, free 366.0 MB)
17/03/28 16:43:37 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.2 KB, free 366.0 MB)
17/03/28 16:43:37 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.3.98:58752 (size: 2.2 KB, free: 366.3 MB)
17/03/28 16:43:37 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/03/28 16:43:37 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (MapPartitionsRDD[7] at coalesce at NativeMethodAccessorImpl.java:0)
17/03/28 16:43:37 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
17/03/28 16:43:37 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 2, localhost, executor driver, partition 1, PROCESS_LOCAL, 6119 bytes)
17/03/28 16:43:37 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, executor driver, partition 0, ANY, 6119 bytes)
17/03/28 16:43:37 INFO Executor: Running task 1.0 in stage 2.0 (TID 2)
17/03/28 16:43:37 INFO Executor: Running task 0.0 in stage 2.0 (TID 3)
17/03/28 16:43:37 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/03/28 16:43:37 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
17/03/28 16:43:37 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
17/03/28 16:43:37 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
17/03/28 16:43:37 INFO Executor: Finished task 1.0 in stage 2.0 (TID 2). 1461 bytes result sent to driver
17/03/28 16:43:37 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 2) in 55 ms on localhost (executor driver) (1/2)
17/03/28 16:43:37 INFO Executor: Finished task 0.0 in stage 2.0 (TID 3). 1552 bytes result sent to driver
17/03/28 16:43:37 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 48 ms on localhost (executor driver) (2/2)
17/03/28 16:43:37 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
17/03/28 16:43:37 INFO DAGScheduler: ResultStage 2 (collect at /Users/changed/test.py:13) finished in 0.061 s
17/03/28 16:43:37 INFO DAGScheduler: Job 1 finished: collect at /Users/changed/test.py:13, took 0.507028 s
Traceback (most recent call last):
  File "/Users/changed/test.py", line 17, in <module>
    main()
  File "/Users/changed/test.py", line 13, in main
    print("PARTITIONED: ", partitioned.collect())
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 810, in collect
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 140, in _load_from_socket
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 529, in load_stream
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 524, in loads
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
17/03/28 16:43:37 INFO SparkContext: Invoking stop() from shutdown hook
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
[...]

测试文件中有一个0x80字节，这就是为什么我想知道文件是否第二次没有解压缩的原因：

$ xxd y.gz
00000000: 1f8b 0800 eeca da58 0003 abe4 aac4 8000  .......X........
00000010: 2a98 c9ed 1400 0000                      *.......

相同的Python作业在未压缩的文件上运行良好。

我错过了重新分区所需的一些步骤吗？

更新：我注意到如果我在分区之前添加地图，则错误消失。例如。 partitioned = asText.map(lambda x: x).repartition(2)。这可能是一种解决方法，但有点令人困惑和不满意。

Spark - 重新分区gzip压缩文件

0 个答案: