我正在运行一个Spark作业,该作业查询一个Hive表并对其执行一些基于窗口的操作,然后执行一些联接。即使记录数量只有36k,该作业的运行速度也相当慢。作业以本地模式运行。我注意到spark作业记录了与配置单元的重复连接。以下代码段重复了几次。我想了解为什么会有多个连接,以及输入拆分的含义。预先感谢
DAGScheduler: failed: Set()
18/07/20 02:36:35 INFO HadoopRDD: Input split: HBase table split(table name: tblHRData, scan: , start row: , end row: 20180411_execution_executionFile_201804113065, region location: ftrhdp31.fyre.ibm.com)
18/07/20 02:36:35 INFO RecoverableZooKeeper: Process identifier=hconnection-0x8049832 connecting to ZooKeeper ensemble=hdp31.fyre.xx.com:2181,hdp41.fyre.xx.com:2181
18/07/20 02:36:35 INFO ZooKeeper: Initiating client connection, connectString=hdp31.fyre.xx.com:2181,hdp41.fyre.xx.com:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@4b69883a
18/07/20 02:36:35 INFO ClientCnxn: Opening socket connection to server ftrhdp31.fyre.ibm.com/91.130.197.145:2181. Will not attempt to authenticate using SASL (unknown error)
18/07/20 02:36:35 INFO ClientCnxn: Socket connection established, initiating session, client: /91.130.197.145:49558, server: ftrhdp31.fyre.ibm.com/91.130.197.145:2181
18/07/20 02:36:35 INFO ClientCnxn: Session establishment complete on server ftrhdp31.fyre.ibm.com/91.130.197.145:2181, sessionid = 0x1648c0c9e5c07aa, negotiated timeout = 60000
18/07/20 02:36:35 INFO TableInputFormatBase: Input split length: 325 M bytes.
18/07/20 02:36:35 INFO CodeGenerator: Code generated in 11.897199 ms