我正在尝试将Spark 2.1作业的指标集成到Ganglia。
我的spark-default.conf看起来像
Warning: Ignoring non-spark config property: *.sink.ganglia.host=host
Warning: Ignoring non-spark config property: *.sink.ganglia.name=Name
Warning: Ignoring non-spark config property: *.sink.ganglia.mode=unicast
Warning: Ignoring non-spark config property: *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
Warning: Ignoring non-spark config property: *.sink.ganglia.period=10
Warning: Ignoring non-spark config property: *.sink.ganglia.port=8649
Warning: Ignoring non-spark config property: *.sink.ganglia.unit=seconds
当我提交工作时,我可以看到警告
Hadoop : Amazon 2.7.3 - emr-5.7.0
Spark : Spark 2.1.1,
Ganglia: 3.7.2
我的环境详情
<p style='white-space: pre;'>
如果你有任何输入或Ganglia的任何其他选择,请回复。
答案 0 :(得分:1)
度量系统是通过Spark期望出现在$ SPARK_HOME / conf / metrics.properties中的配置文件配置的。可以通过spark.metrics.conf配置属性指定自定义文件位置。
所以不要在spark-default.conf
中添加这些内容,而是将其移至$SPARK_HOME/conf/metrics.properties
答案 1 :(得分:1)
对于EMR,您需要将这些设置放在主节点上的/etc/spark/conf/metrics.properties
中。
Spark on EMR确实包括Ganglia库:
$ ls -l /usr/lib/spark/external/lib/spark-ganglia-lgpl_*
-rw-r--r-- 1 root root 28376 Mar 22 00:43 /usr/lib/spark/external/lib/spark-ganglia-lgpl_2.11-2.3.0.jar
此外,您的示例缺少配置名称和值之间的等号(=
) - 不确定这是否是个问题。下面是一个为我成功运行的示例配置。
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
*.sink.ganglia.name=AMZN-EMR
*.sink.ganglia.host=$MASTERIP
*.sink.ganglia.port=8649
*.sink.ganglia.mode=unicast
*.sink.ganglia.period=10
*.sink.ganglia.unit=seconds
答案 2 :(得分:0)
从此页面: https://spark.apache.org/docs/latest/monitoring.html
Spark also supports a Ganglia sink which is not included in the default build due to licensing restrictions:
GangliaSink: Sends metrics to a Ganglia node or multicast group.
**To install the GangliaSink you’ll need to perform a custom build of Spark**. Note that by embedding this library you will include LGPL-licensed code in your Spark package. For sbt users, set the SPARK_GANGLIA_LGPL environment variable before building. For Maven users, enable the -Pspark-ganglia-lgpl profile. In addition to modifying the cluster’s Spark build user
答案 3 :(得分:0)
不知道还有没有人需要这个。但是您必须进行完整的 Ganglia 配置:
# Ganglia conf
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
*.sink.ganglia.name=AMZN-EMR
*.sink.ganglia.host=$MASTERIP
*.sink.ganglia.port=8649
*.sink.ganglia.mode=unicast
*.sink.ganglia.period=10
*.sink.ganglia.unit=seconds
# Enable JvmSource for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
即使使用完整配置,我也遇到了来自 AWS EMR 5.33.0 的这个问题
21/05/26 14:18:20 ERROR org.apache.spark.metrics.MetricsSystem: Source class org.apache.spark.metrics.source.JvmSource cannot be instantiated
java.lang.ClassNotFoundException: org.apache.spark.metrics.source.JvmSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:239)
at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSources$1.apply(MetricsSystem.scala:184)
at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSources$1.apply(MetricsSystem.scala:181)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
at org.apache.spark.metrics.MetricsSystem.registerSources(MetricsSystem.scala:181)
at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:528)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
21/05/26 14:18:20 ERROR org.apache.spark.metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.GangliaSink cannot be instantiated
21/05/26 14:18:20 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.lang.ClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:239)
at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:200)
at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:196)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:196)
at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:104)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:528)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
这很奇怪,因为 AWS EMR 应该提供这个依赖项 (org.apache.spark:spark-core_2.11:2.4.7
),我希望带有 AWS EMR 的 Spark 发行版是使用 Ganglia 选项编译的。在 --packages 或 --jars spark 选项上强制使用这个 jar 也无济于事。
如果有人设法让 Ganglia 与 AWS EMR 上的 Spark 一起工作,并带有驱动程序/执行程序 jvm 监控。请告诉我怎么做。