为什么SparkContext创建与Hive Metastore的大量连接并扫描所有数据库?

时间:2019-09-11 21:44:49

标签: apache-spark hive hive-metastore

我们已从一个管理和监视hadoop集群的团队获悉,我们的Spark作业(其中的数百个)对集群有负面影响。显然,每个创建SparkContext的作业也会自动创建与Hive Metastore的大量连接,并“扫描”所有数据库和表(请参见下面的日志)。

做完一些测试后,我们发现在使用Spark 1.6( / usr / hdp / current / spark-client )创建SparkContext之后,我们观察到了上述行为。切换到Spark 2( / usr / hdp / current / spark2-client )后,我们看不到与Hive Metastore的任何连接。

代码如下:

from pyspark import SparkContext

os.environ['SPARK_HOME'] = "/usr/hdp/current/spark-client"
PYSPARK_SUBMIT_ARGS = " --conf spark.logConf=true --master yarn --driver-memory 25g --num-executors 50 --total-executor-cores 3 --executor-memory 25g pyspark-shell"
os.environ['PYSPARK_SUBMIT_ARGS'] = PYSPARK_SUBMIT_ARGS
sc = SparkContext(appName="my_app_name")

我们使用:HDP 2.5; Spark 1.6(PySpark API);蜂巢:1.2.1;集群管理器:纱线

你们能解释为什么Spark 1.6会发生这种情况,以及避免与Hive Metastore建立所有这些连接的可能解决方案(切换到Spark 2除外)?非常感谢您对挖掘方向的任何建议。

我已经看过hadoop和hive的配置属性,但是我不知道更改其中的一个或多个是否对我们有帮助。我还阅读了有关HiveSupport()的信息,但我相信只能从2.0版开始使用。

  

2019-09-10 10:32:26,662信息[pool-7-thread-56114]:   HiveMetaStore.audit(HiveMetaStore.java:logAuditEvent(319))-ugi = @   ip = cmd = get_all_databases 2019-09-10 10:32:26,898信息   [pool-7-thread-56114]:HiveMetaStore.audit   (HiveMetaStore.java:logAuditEvent(319))-ugi = @ ip = cmd = get_functions:   db = db1 pat = * 2019-09-10 10:32:26,902信息[pool-7-thread-56114]:   HiveMetaStore.audit(HiveMetaStore.java:logAuditEvent(319))-ugi = @   ip = cmd = get_functions:db = db2 pat = * 2019-09-10 10:32:26,904信息   [pool-7-thread-56114]:HiveMetaStore.audit   (HiveMetaStore.java:logAuditEvent(319))-ugi = @ ip = cmd = get_functions:   db = db3 pat = * 2019-09-10 10:32:26,905信息[pool-7-thread-56114]:   HiveMetaStore.audit(HiveMetaStore.java:logAuditEvent(319))-ugi = @   ip = cmd = get_functions:db = db4 pat = * 2019-09-10 10:32:26,907信息   [pool-7-thread-56114]:HiveMetaStore.audit   (HiveMetaStore.java:logAuditEvent(319))-ugi = @ ip = cmd = get_functions:   db = db5 pat = * 2019-09-10 10:32:26,909信息[pool-7-thread-56114]:   HiveMetaStore.audit(HiveMetaStore.java:logAuditEvent(319))-ugi = @   ip = cmd = get_functions:db = db6 pat = * 2019-09-10 10:32:26,910信息   [pool-7-thread-56114]:HiveMetaStore.audit   (HiveMetaStore.java:logAuditEvent(319))-ugi = @ ip = cmd = get_functions:   db = db7 pat = * 2019-09-10 10:32:26,912信息[pool-7-thread-56114]:   HiveMetaStore.audit(HiveMetaStore.java:logAuditEvent(319))-ugi = @   ip = cmd = get_functions:db = db8 pat = * 2019-09-10 10:32:26,914信息   [pool-7-thread-56114]:HiveMetaStore.audit   (HiveMetaStore.java:logAuditEvent(319))-ugi = @ ip = cmd = get_functions:   db = db9 pat = *

     

...

0 个答案:

没有答案