如何通过Python访问Hive?

时间:2014-01-26 23:01:46

标签: python hadoop hive

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python似乎已过时。

当我将其添加到/ etc / profile:

export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py

然后我可以执行链接中列出的导入,但from hive import ThriftHive实际需要的除外:

from hive_service import ThriftHive

接下来示例中的端口是10000,当我尝试时导致程序挂起。默认的Hive Thrift端口是9083,它停止了悬挂。

所以我这样设置:

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
    transport = TSocket.TSocket('<node-with-metastore>', 9083)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = ThriftHive.Client(protocol)
    transport.open()
    client.execute("CREATE TABLE test(c1 int)")

    transport.close()
except Thrift.TException, tx:
    print '%s' % (tx.message)

我收到以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'execute'

但检查ThriftHive.py文件会显示该方法在Client类中执行。

我如何使用Python访问Hive?

18 个答案:

答案 0 :(得分:39)

我认为最简单的方法是使用PyHive。

要安装,您需要这些库:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

请注意,虽然您将库安装为PyHive,但您将模块导入pyhive,全部为小写。

如果您使用的是Linux,则可能需要在运行上述内容之前单独安装SASL。使用apt-get或yum或您的发行版的任何软件包管理器安装软件包libsasl2-dev。对于Windows,GNU.org上有一些选项,您可以下载二进制安装程序。在Mac上,如果您已安装xcode开发人员工具(终端中为xcode-select --install),则应该可以使用SASL

安装完成后,您可以像这样连接到Hive:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

现在您已经拥有了hive连接,您可以选择如何使用它。您可以直接查询:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

...或使用连接制作Pandas数据帧:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)

答案 1 :(得分:24)

我断言您正在使用HiveServer2,这是导致代码无效的原因。

您可以使用pyhs2正确访问您的Hive,并使用示例代码:

import pyhs2

with pyhs2.connect(host='localhost',
               port=10000,
               authMechanism="PLAIN",
               user='root',
               password='test',
               database='default') as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

        #Execute query
        cur.execute("select * from table")

        #Return column info from query
        print cur.getSchema()

        #Fetch table results
        for i in cur.fetch():
            print i

请注意,在使用pip安装pyhs2之前,可以安装python-devel.x86_64 cyrus-sasl-devel.x86_64。

希望这可以帮到你。

参考:https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

答案 2 :(得分:13)

下面的python程序应该可以从python访问hive表:

import commands

cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' "

status, output = commands.getstatusoutput(cmd)

if status == 0:
   print output
else:
   print "error"

答案 3 :(得分:6)

你可以使用hive库,因为你想从hive import ThriftHive导入hive类

试试这个例子:

import sys

from hive import ThriftHive
from hive.ttypes import HiveServerException

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
  transport = TSocket.TSocket('localhost', 10000)
  transport = TTransport.TBufferedTransport(transport)
  protocol = TBinaryProtocol.TBinaryProtocol(transport)
  client = ThriftHive.Client(protocol)
  transport.open()
  client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
  client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
  client.execute("SELECT * FROM r")
  while (1):
    row = client.fetchOne()
    if (row == None):
       break
    print row

  client.execute("SELECT * FROM r")
  print client.fetchAll()
  transport.close()
except Thrift.TException, tx:
  print '%s' % (tx.message)

答案 4 :(得分:5)

要使用用户名/密码连接并指定端口,代码如下所示:

from pyhive import presto

cursor = presto.connect(host='host.example.com',
                    port=8081,
                    username='USERNAME:PASSWORD').cursor()

sql = 'select * from table limit 10'

cursor.execute(sql)

print(cursor.fetchone())
print(cursor.fetchall())

答案 5 :(得分:4)

上面的例子有点过时了。 这里有一个新的例子:

import pyhs2 as hive
import getpass
DEFAULT_DB = 'default'
DEFAULT_SERVER = '10.37.40.1'
DEFAULT_PORT = 10000
DEFAULT_DOMAIN = 'PAM01-PRD01.IBM.COM'

u = raw_input('Enter PAM username: ')
s = getpass.getpass()
connection = hive.connect(host=DEFAULT_SERVER, port= DEFAULT_PORT, authMechanism='LDAP', user=u + '@' + DEFAULT_DOMAIN, password=s)
statement = "select * from user_yuti.Temp_CredCard where pir_post_dt = '2014-05-01' limit 100"
cur = connection.cursor()

cur.execute(statement)
df = cur.fetchall() 

除了标准的python程序之外,还需要安装一些库以允许Python构建与Hadoop数据库的连接。

1.Pyhs2,Python Hive Server 2客户端驱动程序

2.Sasl,Python的Cyrus-SASL绑定

3.Thrift,Apache Thrift RPC系统的Python绑定

4.PyHive,Hive的Python接口

请记住更改可执行文件的权限

chmod + x test_hive2.py ./test_hive2.py

希望它对你有所帮助。 参考:https://sites.google.com/site/tingyusz/home/blogs/hiveinpython

答案 6 :(得分:3)

与eycheu的解决方案类似,但更为详细。

以下是专门针对hive2 的替代解决方案,需要PyHive或安装系统范围的软件包。我正在开发一个我没有root访问权限的linux环境,因此安装Tristin帖子中提到的SASL依赖项对我来说不是一个选项:

  

如果您使用的是Linux,则可能需要在运行上述内容之前单独安装SASL。使用apt-get或yum或您的发行版的任何软件包管理器安装软件包libsasl2-dev。

具体来说,这个解决方案专注于利用python包:JayDeBeApi。根据我的经验,在python Anaconda 2.7安装之上安装这个额外的软件包就是我所需要的。这个包利用了java(JDK)。我假设已经设置好了。

第1步:安装JayDeBeApi

pip install jaydebeap

第2步:为您的环境下载适当的驱动程序

将所有.jar文件存储在目录中。我将此目录称为/ path / to / jar / files /.

第3步:确定您的系统身份验证机制:

在列出的pyhive解决方案中,我看到PLAIN被列为身份验证机制以及Kerberos。 请注意,您的jdbc连接URL将取决于您使用的身份验证机制。我将在不传递用户名/密码的情况下解释 Kerberos解决方案Here is more information Kerberos authentication and options.

创建Kerberos票证(如果尚未创建)

$ kinit

可以通过klist查看门票。

您现在已准备好通过python进行连接:

import jaydebeapi
import glob
# Creates a list of jar files in the /path/to/jar/files/ directory
jar_files = glob.glob('/path/to/jar/files/*.jar')

host='localhost'
port='10000'
database='default'

# note: your driver will depend on your environment and drivers you've
# downloaded in step 2
# this is the driver for my environment (jdbc3, hive2, cloudera enterprise)
driver='com.cloudera.hive.jdbc3.HS2Driver'

conn_hive = jaydebeapi.connect(driver,
        'jdbc:hive2://'+host+':' +port+'/'+database+';AuthMech=1;KrbHostFQDN='+host+';KrbServiceName=hive'
                           ,jars=jar_files)

如果您只关心阅读,那么您可以通过eycheu的解决方案直接将其直接读入熊猫的数据框:

import pandas as pd
df = pd.read_sql("select * from table", conn_hive)

否则,这是一个更通用的通信选项:

cursor = conn_hive.cursor()
sql_expression = "select * from table"
cursor.execute(sql_expression)
results = cursor.fetchall()

你可以想象,如果你想创建一个表,你就不需要&#34; fetch&#34;结果,但可以提交创建表查询。

答案 7 :(得分:3)

通常的做法是禁止用户在群集节点上下载和安装软件包和库。在这种情况下,如果hive在同一节点上运行,则@ python-starter和@goks的解决方案可以完美地工作。否则,可以使用beeline代替hive命令行工具。参见details

#python 2
import commands

cmd = 'beeline -u "jdbc:hive2://node07.foo.bar:10000/...<your connect string>" -e "SELECT * FROM db_name.table_name LIMIT 1;"'

status, output = commands.getstatusoutput(cmd)

if status == 0:
   print output
else:
   print "error"

#python 3
import subprocess

cmd = 'beeline -u "jdbc:hive2://node07.foo.bar:10000/...<your connect string>" -e "SELECT * FROM db_name.table_name LIMIT 1;"'

status, output = subprocess.getstatusoutput(cmd)

if status == 0:
   print(output)
else:
   print("error")

答案 8 :(得分:3)

类似于@ python-starter解决方案。但是,命令包在python3.x上是不可用的。所以替代解决方案是在python3.x中使用子进程

import subprocess

cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' "

status, output = subprocess.getstatusoutput(cmd)

if status == 0:
   print(output)
else:
   print("error")

答案 9 :(得分:2)

这可以快速破解连接hive和python,

letter = sys.argv[2] if len(sys.argv) >= 3 else 'a'

输出:元组列表

答案 10 :(得分:2)

不再维护pyhs2。更好的选择是impyla

不要混淆以下关于Impala的一些上述例子;只需将 HiveServer2 的端口更改为10000(默认),它的工作方式与Impala示例相同。它与Impala和Hive都使用相同的协议(Thrift)。

https://github.com/cloudera/impyla

它具有比pyhs2更多的功能,例如,它具有Kerberos身份验证,这对我们来说是必须的。

from impala.dbapi import connect
conn = connect(host='my.host.com', port=10000)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description  # prints the result set's schema
results = cursor.fetchall()

##
cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
    process(row)

Cloudera现在正在为hs2客户端https://github.com/cloudera/hs2client付出更多努力 这是一个C / C ++ HiveServer2 / Impala客户端。如果你向python推送大量数据,那么可能是更好的选择。 (也有Python绑定 - https://github.com/cloudera/hs2client/tree/master/python

有关impyla的更多信息:

答案 11 :(得分:2)

这是一种通用方法,对我来说很容易,因为我一直通过python连接到多个服务器(SQL,Teradata,Hive等)。因此,我使用pyodbc连接器。这是开始使用pyodbc的一些基本步骤(以防您从未使用过):

  • 先决条件:在执行以下步骤之前,您应该在Windows安装程序中具有相关的ODBC连接。如果没有,请找到相同的here

完成后: 步骤1. pip安装: pip install pyodbchere's the link to download the relevant driver from Microsoft's website

第2步,现在,将其导入您的python脚本中:

import pyodbc

第3步。最后,继续并提供如下的连接详细信息:

conn_hive = pyodbc.connect('DSN = YOUR_DSN_NAME , SERVER = YOUR_SERVER_NAME, UID = USER_ID, PWD = PSWD' )

使用pyodbc最好的部分是我只需导入一个软件包即可连接到几乎所有数据源。

答案 12 :(得分:1)

您可以使用python JayDeBeApi包从Hive或Impala JDBC驱动程序创建DB-API连接,然后将连接传递给pandas.read_sql函数以返回pandas数据帧中的数据。

import jaydebeapi
# Apparently need to load the jar files for the first time for impala jdbc driver to work 
conn = jaydebeapi.connect('com.cloudera.hive.jdbc41.HS2Driver',
['jdbc:hive2://host:10000/db;AuthMech=1;KrbHostFQDN=xxx.com;KrbServiceName=hive;KrbRealm=xxx.COM', "",""],
jars=['/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/HiveJDBC41.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/TCLIServiceClient.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-codec-1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_metastore.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_service.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/log4j-1.2.14.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/ql.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/zookeeper-3.4.6.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ImpalaJDBC41.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/TCLIServiceClient.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-codec-1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_metastore.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_service.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/log4j-1.2.14.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ql.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/zookeeper-3.4.6.jar'
])

# the previous call have initialized the jar files, technically this call needs not include the required jar files
impala_conn = jaydebeapi.connect('com.cloudera.impala.jdbc41.Driver',
['jdbc:impala://host:21050/db;AuthMech=1;KrbHostFQDN=xxx.com;KrbServiceName=impala;KrbRealm=xxx.COM',"",""],
jars=['/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/HiveJDBC41.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/TCLIServiceClient.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-codec-1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_metastore.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_service.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/log4j-1.2.14.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/ql.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/zookeeper-3.4.6.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ImpalaJDBC41.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/TCLIServiceClient.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-codec-1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_metastore.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_service.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/log4j-1.2.14.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ql.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/zookeeper-3.4.6.jar'
])

import pandas as pd
df1 = pd.read_sql("SELECT * FROM tablename", conn)
df2 = pd.read_sql("SELECT * FROM tablename", impala_conn)

conn.close()
impala_conn.close()

答案 13 :(得分:1)

最简单的方法是使用 PyHive。

要安装,您需要这些库:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

安装后,您可以像这样连接到 Hive:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

现在您有了 hive 连接,您可以选择如何使用它。您可以直接查询:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

...或使用连接制作 Pandas 数据框:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)

答案 14 :(得分:0)

我已经与您解决了相同的问题,这是我的操作环境( 系统:Linux 版本:python 3.6 包装:Pyhive) 请参考我的回答如下:

from pyhive import hive
conn = hive.Connection(host='149.129.***.**', port=10000, username='*', database='*',password="*",auth='LDAP')

关键是添加参考密码和auth,同时将auth设置为'LDAP'。这样就可以了,任何问题请让我知道

答案 15 :(得分:0)

通过使用Python客户端驱动程序

pip install pyhs2

然后

import pyhs2

with pyhs2.connect(host='localhost',
               port=10000,
               authMechanism="PLAIN",
               user='root',
               password='test',
               database='default') as conn:
with conn.cursor() as cur:
    #Show databases
    print cur.getDatabases()

    #Execute query
    cur.execute("select * from table")

    #Return column info from query
    print cur.getSchema()

    #Fetch table results
    for i in cur.fetch():
        print i

引用:https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

答案 16 :(得分:0)

所有答案均未演示如何获取和打印表头。修改了PyHive中的标准示例,该示例得到了广泛使用和积极维护。

from pyhive import hive
cursor = hive.connect(host="localhost", 
                      port=10000, 
                      username="shadan", 
                      auth="KERBEROS", 
                      kerberos_service_name="hive"
                      ).cursor()
cursor.execute("SELECT * FROM my_dummy_table LIMIT 10")
columnList = [desc[0] for desc in cursor.description]
headerStr = ",".join(columnList)
headerTuple = tuple(headerStr.split (",")
print(headerTuple)
print(cursor.fetchone())
print(cursor.fetchall())

答案 17 :(得分:0)

最简单的方法 |使用 sqlalchemy

要求:

  • pip 安装 pyhive

代码:

import pandas as pd
from sqlalchemy import create_engine

SECRET = {'username':'lol', 'password': 'lol'}
user_name = SECRET.get('username')
passwd = SECRET.get('password')

host_server = 'x.x.x.x'
port = '10000'
database = 'default'
conn = f'hive://{user_name}:{passwd}@{host_server}:{port}/{database}'
engine = create_engine(conn, connect_args={'auth': 'LDAP'})

query = "select * from tablename limit 100"
data = pd.read_sql(query, con=engine)
print(data)