我正在使用impyla包,使用impyla包的SQLAlchemy支持,通过python在Hive on Spark上运行一系列查询。 SQLAlchemy自动为每个执行的sql语句创建并关闭dbapi游标。因为impyla HiveServer2Cursor实现关闭了基础Hive会话,所以每个sql语句最终都作为单独的Spark作业运行。我想避免为每个sql语句启动新的Spark作业并利用SQLAlchemy而不是原始dbapi接口的开销。
重用dbapi游标当然可以,但是我还是更喜欢使用SQLAlchemy引擎及其连接池和自动游标管理功能。
# this version uses raw dbapi and only one cursor and therfore one hive session
con = connect(host='cdh-dn8.ec2.internal', port=10000, kerberos_service_name='hive', auth_mechanism='GSSAPI')
cur = con.cursor()
cur.execute('set hive.execution.engine=spark')
cur.execute("select * from reference.zipcode where zip = '55112'")
rows = cur.fetchall()
# use data from result and execute more queries ...
cur.close()
con.close()
# this version uses sqlalchemy and one cursor per statement executed, resulting in multiple hive sessions
sqlalchemyengine = create_engine('impala://cdh-dn8.ec2.internal:10000', kerberos_service_name='hive', auth_mechanism='GSSAPI')
conn = sqlalchemyengine.connect()
conn.execute('set hive.execution.engine=spark')
result = conn.execute("select * from reference.zipcode where zip = '55112'")
# use data from result and execute more queries ...
我想知道impyla是否有充分的理由使用每个游标打开和关闭Hive会话,而不是在关闭连接时关闭Hive会话。