简单的Spark流应用程序在集群中分配所有内存-GCP Dataproc

时间:2019-06-08 00:58:33

标签: apache-spark spark-streaming yarn google-cloud-dataproc

$gnatmake -o hello *.adb gcc -c hello.adb gnatbind -x hello.ali gnatlink hello.ali -o hello $hello Starting Worker Starting Worker Starting Worker Starting Worker Starting Worker Starting Worker Starting Worker Starting Worker Starting Worker Starting Worker Worker Completed Work Starting Worker Worker Completed Work Starting Worker Worker Completed Work Starting Worker Worker Completed Work Starting Worker Worker Completed Work Starting Worker Worker Completed Work Starting Worker Worker Completed Work Starting Worker Worker Completed Work Worker Completed Work Starting Worker Worker Completed Work Starting Worker Starting Worker Received Stop Signal Worker Completed Work Worker Completed Work Worker Completed Work Worker Completed Work Worker Completed Work Worker Completed Work Worker Completed Work Worker Completed Work Worker Completed Work Worker Completed Work Manager is Finished Program is Done 更改为STATE后,一个没有大量内存计算的简单Spark流媒体应用就消耗了 17GB 的内存。

集群设置:

  • 1个主服务器(2个vCPU,13.0 GB内存)
  • 2个工作线程(2个vCPU,13.0 GB内存)

YARN资源管理器显示:Mem Total-18GB,vCore Total-4

Spark流媒体应用程序源代码可以在这里找到,并且您看到它并没有太大作用:

Spark Submit命令(通过SSH而不是GCLOUD SDK):

RUNNING

为什么这么简单的应用程序会分配那么多内存?

我使用的是GCP Dataproc默认配置,是否应该修改任何YARN配置?

1 个答案:

答案 0 :(得分:1)

您的应用程序需要多少个任务?请注意,默认情况下,Dataproc已启用dynamic allocation,它将在必要时从YARN请求更多执行者。