所以我在以下结构中有一个Pyspark项目:
main.py:做真实的东西(从utils.py导入pyspark udf,从common.py导入东西)
utils.py:一些实用功能(从common.py导入)
common.py:some params
在Pyspark shell中,我可以按顺序运行common.py,utils.py,main.py中的代码,并且可以得到我想要的结果;但是,如果我通过spark-submit提交它,则不会报告任何错误,但作业仍然在群集负载< 1%,我怀疑这表明什么都没有被计算出来。
以下是spark-submit代码:
spark-submit --master yarn --deploy-mode cluster --driver-cores 4 --driver-memory 20G --num-executors 10 --executor-cores 4 --executor-memory 20G --py-files project.zip project/main.py
这是main.py中的内容(其他2 .py文件有点冗长):
from pyspark.sql import SparkSession
from utils import foo1, foo2
from common import bar1, bar2
if __name__ == '__main__':
input_path = bar1
output_path = bar2
# build spark session
spark = SparkSession.builder\
.appName("app")\
.getOrCreate()
rd = spark.read.json(input_path).repartition(100)
# add a new column Col2
rd_app = rd.withColumn('Col2', foo1(rd.Col1))
rd_app_not_null = rd_app.filter('Col2 is not null')
# cleanup queries
rd_no_query = rd_app_not_null\
.withColumn('URLClean', foo2(rd_app_not_null.URL))\
.drop('URL')\
.withColumnRenamed('URLClean', 'URL')
# save to S3
rd_no_query.write.json(output_path, compression='gzip')
在EMR上运行;在Yarn客户端模式下使用Pyspark shell,在Yarn集群模式下进行spark-submit。
感谢任何帮助!
编辑:最后几行日志(数小时保持这样):
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 285.578911 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 41.334709 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 9.883221 ms
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 313.5 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 28.2 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.42.228:35995 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 2 from json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
17/03/08 22:06:31 INFO SparkContext: Starting job: json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO DAGScheduler: Registering RDD 8 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Got job 1 (json at NativeMethodAccessorImpl.java:0) with 100 output partitions
17/03/08 22:06:31 INFO DAGScheduler: Final stage: ResultStage 2 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0), which has no missing parents
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 25.3 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.8 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.31.42.228:35995 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/03/08 22:06:31 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO YarnClusterScheduler: Adding task set 1.0 with 24 tasks
17/03/08 22:06:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, ip-172-31-46-251.ec2.internal, executor 2, partition 0, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, ip-172-31-43-215.ec2.internal, executor 4, partition 1, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, ip-172-31-41-81.ec2.internal, executor 3, partition 2, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 27, ip-172-31-34-182.ec2.internal, executor 5, partition 3, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 28, ip-172-31-46-251.ec2.internal, executor 2, partition 4, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 29, ip-172-31-43-215.ec2.internal, executor 4, partition 5, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 30, ip-172-31-41-81.ec2.internal, executor 3, partition 6, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 31, ip-172-31-34-182.ec2.internal, executor 5, partition 7, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 32, ip-172-31-46-251.ec2.internal, executor 2, partition 8, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 33, ip-172-31-43-215.ec2.internal, executor 4, partition 9, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 10.0 in stage 1.0 (TID 34, ip-172-31-41-81.ec2.internal, executor 3, partition 10, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 35, ip-172-31-34-182.ec2.internal, executor 5, partition 11, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 12.0 in stage 1.0 (TID 36, ip-172-31-46-251.ec2.internal, executor 2, partition 12, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 13.0 in stage 1.0 (TID 37, ip-172-31-43-215.ec2.internal, executor 4, partition 13, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 14.0 in stage 1.0 (TID 38, ip-172-31-41-81.ec2.internal, executor 3, partition 14, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 15.0 in stage 1.0 (TID 39, ip-172-31-34-182.ec2.internal, executor 5, partition 15, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO YarnAllocator: Driver requested a total number of 5 executor(s).
17/03/08 22:06:32 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:32 INFO ExecutorAllocationManager: Requesting 5 new executors because tasks are backlogged (new desired total will be 5)
17/03/08 22:06:32 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:32 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO AMRMClientImpl: Received new token for : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:32 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000009 on host ip-172-31-45-4.ec2.internal
17/03/08 22:06:32 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:33 INFO YarnAllocator: Driver requested a total number of 6 executor(s).
17/03/08 22:06:33 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:33 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 6)
17/03/08 22:06:33 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:33 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000012 on host ip-172-31-40-142.ec2.internal
17/03/08 22:06:33 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-40-142.ec2.internal:8041
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.40.142:36208) with ID 8
17/03/08 22:06:36 INFO TaskSetManager: Starting task 16.0 in stage 1.0 (TID 40, ip-172-31-40-142.ec2.internal, executor 8, partition 16, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:36 INFO ExecutorAllocationManager: New executor 8 has registered (new total is 5)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 17.0 in stage 1.0 (TID 41, ip-172-31-40-142.ec2.internal, executor 8, partition 17, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 18.0 in stage 1.0 (TID 42, ip-172-31-40-142.ec2.internal, executor 8, partition 18, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 19.0 in stage 1.0 (TID 43, ip-172-31-40-142.ec2.internal, executor 8, partition 19, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-40-142.ec2.internal:42371 with 11.8 GB RAM, BlockManagerId(8, ip-172-31-40-142.ec2.internal, 42371, None)
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO AMRMClientImpl: Received new token for : ip-172-31-33-112.ec2.internal:8041
17/03/08 22:06:36 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 0 of them.
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:37 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.45.4:33330) with ID 7
17/03/08 22:06:37 INFO TaskSetManager: Starting task 20.0 in stage 1.0 (TID 44, ip-172-31-45-4.ec2.internal, executor 7, partition 20, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO ExecutorAllocationManager: New executor 7 has registered (new total is 6)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 21.0 in stage 1.0 (TID 45, ip-172-31-45-4.ec2.internal, executor 7, partition 21, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 22.0 in stage 1.0 (TID 46, ip-172-31-45-4.ec2.internal, executor 7, partition 22, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 23.0 in stage 1.0 (TID 47, ip-172-31-45-4.ec2.internal, executor 7, partition 23, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-45-4.ec2.internal:45646 with 11.8 GB RAM, BlockManagerId(7, ip-172-31-45-4.ec2.internal, 45646, None)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 28.2 KB, free: 11.8 GB)