应用错误收集

时间：2019-11-09 18:44:27

标签： apache-spark pyspark

随着我不断阅读有关Spark架构和调度的在线资源，我开始变得更加困惑。一种资源说：The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage。另一方面：Spark maps the number tasks on a particular Executor to the number of cores allocated to it。因此，第一个资源说如果我有1000个分区，那么无论我的机器是什么，我都会有1000个任务。在第二种情况下，如果我有4台核心计算机和1000个分区，那又如何？我将有4个任务？那么如何处理数据？

答案 0 :(得分：2)

因此，将任务视为必须处理的某些（独立）工作块。他们肯定可以并行运行

因此，如果您有1000个分区和5个执行器，每个执行器有4个核心，则通常将并行运行20个任务