我目前已安装Flink,并在EMR上运行了作业,我现在正尝试通过将指标发送给Prometheus来添加监视。
我遇到了在EMR上运行Flink的问题。我正在使用Terraform设置EMR(下载并运行作业后运行ansible)。开箱即用,它看起来不像EMR的Flink发行版包括可选的jar(flink-metrics-prometheus,flink-cep等)。
看看Flink的文档,
“要使用此报告者,必须将
/opt/flink-metrics-prometheus-1.6.1.jar
复制到Flink发行版的/lib
文件夹中” https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/metrics.html#prometheuspushgateway-orgapacheflinkmetricsprometheusprometheuspushgatewayreporter
但是,当登录到EMR主节点时,/ etc / flink或/ usr / lib / flink都没有名为opts
的目录,而且我在任何地方都看不到flink-metrics-prometheus-1.6.1.jar
。
我知道Flink还有其他可选库,如果要使用它们,通常必须复制它们,例如flink-cep,但是我不确定使用EMR时该如何做。
这是我得到的例外,我相信这是因为它无法在其类路径中找到度量标准jar。
java.lang.ClassNotFoundException: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.flink.runtime.metrics.MetricRegistryImpl.<init>(MetricRegistryImpl.java:144)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.createMetricRegistry(ClusterEntrypoint.java:419)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:276)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:227)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:191)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190)
at org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint.main(YarnSessionClusterEntrypoint.java:137)
EMR资源(terraform)
name = "ce-emr-flink-arn"
release_label = "emr-5.20.0" # 5.21.0 is not found, could be a region thing
applications = ["Flink"]
ec2_attributes {
key_name = "ce_test"
subnet_id = "${aws_subnet.ce_test_subnet_public.id}"
instance_profile = "${aws_iam_instance_profile.emr_profile.arn}"
emr_managed_master_security_group = "${aws_security_group.allow_all_vpc.id}"
emr_managed_slave_security_group = "${aws_security_group.allow_all_vpc.id}"
additional_master_security_groups = "${aws_security_group.external_connectivity.id}"
additional_slave_security_groups = "${aws_security_group.external_connectivity.id}"
}
ebs_root_volume_size = 100
master_instance_type = "m4.xlarge"
core_instance_type = "m4.xlarge"
core_instance_count = 2
service_role = "${aws_iam_role.iam_emr_service_role.arn}"
configurations_json = <<EOF
[
{
"Classification": "flink-conf",
"Properties": {
"parallelism.default": "8",
"state.backend": "RocksDB",
"state.backend.async": "true",
"state.backend.incremental": "true",
"state.savepoints.dir": "file:///savepoints",
"state.checkpoints.dir": "file:///checkpoints",
"web.submit.enable": "true",
"metrics.reporter.promgateway.class": "org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter",
"metrics.reporter.promgateway.host": "${aws_instance.monitoring.private_ip}",
"metrics.reporter.promgateway.port": "9091",
"metrics.reporter.promgateway.jobName": "ce-test",
"metrics.reporter.promgateway.randomJobNameSuffix": "true",
"metrics.reporter.promgateway.deleteOnShutdown": "false"
}
}
]
EOF
}
我怀疑我可能需要在引导阶段下载Jar,但想先检查一下,看看是否有任何执行此操作的示例
答案 0 :(得分:1)
我没有使用Terraform,但是请注意,您通常需要在EMR中的主服务器和从服务器上都进行配置(设置jar)。弄清EMR认为jar应该去哪里的一种方法是在作业运行时登录从服务器,执行 rank change usedprice sold
2015-11-16 02:34:00 289643 NaN 17.51 False
2015-11-16 08:34:00 335865 0.159583 17.51 False
2015-11-16 14:37:00 376721 0.121644 17.51 False
2015-11-17 00:10:00 422663 0.121952 17.51 False
2015-11-17 09:52:00 526799 0.246381 17.51 False
2015-11-18 00:10:00 590056 0.120078 17.51 False
2015-11-18 08:50:00 656206 0.112108 17.51 False
2015-11-18 23:10:00 723676 0.102818 17.51 False
2015-11-19 11:53:00 818999 0.131721 17.51 False
2015-11-19 20:46:00 840502 0.026255 17.51 False
2015-11-22 12:38:00 1112502 0.323616 17.51 False
2015-11-28 10:32:00 1445509 0.299332 17.51 False
2015-12-03 03:51:00 1795895 0.242396 17.51 False
2015-12-06 21:29:00 2071463 0.153443 17.51 False
2015-12-13 03:26:00 1188341 -0.426328 17.51 True
,找到ps auxwww | grep java
进程,查看启动时添加到类路径中的jar ,然后找到它们在服务器上的位置。或者至少在过去对我有用。
答案 1 :(得分:0)
我选择了EMR版本emr-5.24.0,并使用influxdb .jar进行监视。
我已将.jar文件复制到/ usr / lib / flink / lib文件夹,并以sudo权限使用/usr/lib/flink/bin/stop-cluster.sh && /usr/lib/flink/bin/stop-cluster.sh
重新启动Flink集群。
我认为您可以通过相同的步骤来解决普罗米修斯问题
[ec2-user@ip-10-0-11-17 ~]$ cd /usr/lib/flink/opt/flink-metrics-
flink-metrics-datadog-1.8.0.jar flink-metrics-influxdb-1.8.0.jar flink-metrics-slf4j-1.8.0.jar
flink-metrics-graphite-1.8.0.jar flink-metrics-prometheus-1.8.0.jar flink-metrics-statsd-1.8.0.jar
[ec2-user@ip-10-0-11-17 ~]$ ll /usr/lib/flink/opt/flink-metrics-prometheus-1.8.0.jar
-rw-r--r-- 1 root root 101984 may 14 19:21 /usr/lib/flink/opt/flink-metrics-prometheus-1.8.0.jar
[ec2-user@ip-10-0-11-17 ~]$ uname -a
Linux ip-10-0-11-17 4.14.114-83.126.amzn1.x86_64 #1 SMP Tue May 7 02:26:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux