Question

使用命令：

describe formatted my_table partition my_partition

我们可以在my_partition中列出包含分区my_table的hdfs位置的元数据。但是我们怎样才能获得2列的输出：

Partition | Location

会列出my_table中的所有分区及其hdfs位置吗？

Answer 1

查询Metastore。

演示

<强>蜂房

create table mytable (i int) partitioned by (dt date,type varchar(10))
;

alter table mytable add 
    partition (dt=date '2017-06-10',type='A')
    partition (dt=date '2017-06-11',type='A')
    partition (dt=date '2017-06-12',type='A')
    partition (dt=date '2017-06-10',type='B')
    partition (dt=date '2017-06-11',type='B')
    partition (dt=date '2017-06-12',type='B')
;

Metastore （MySQL）

select  p.part_name
       ,s.location

from            metastore.DBS           as d

        join    metastore.TBLS          as t

        on      t.db_id         =
                d.db_id

        join    metastore.PARTITIONS    as p

        on      p.tbl_id        =
                t.tbl_id

        join    metastore.SDS           as s

        on      s.sd_id         =
                p.sd_id

where   d.name     = 'default'
    and t.tbl_name = 'mytable'
;

+----------------------+----------------------------------------------------------------------------------+
|      part_name       |                                     location                                     |
+----------------------+----------------------------------------------------------------------------------+
| dt=2017-06-10/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-10/type=A |
| dt=2017-06-11/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-11/type=A |
| dt=2017-06-12/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-12/type=A |
| dt=2017-06-10/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-10/type=B |
| dt=2017-06-11/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-11/type=B |
| dt=2017-06-12/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-12/type=B |
+----------------------+----------------------------------------------------------------------------------+

Answer 2

如果没有必要以很好的表格格式获取信息-并且您无权访问HMS数据库，则可能需要运行explain extended：

explain extended select * from default.mytable;

然后您可以提取基本信息partition values和location。


root@ubuntu:/home/sathya# hive -e "explain extended select * from default.mytable;" | grep location
OK
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-10/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-10/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-11/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-11/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-12/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-12/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable

Answer 3

从我的角度来看，最好的解决方案是通过 Thrift 协议从 Hive Metastore 获取此信息。
如果您使用 python 编写代码，则可以使用 hmsclient 库：

蜂巢cli：

hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds

hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds

Python 命令行：

>>> from hmsclient import hmsclient
>>> client = hmsclient.HMSClient(host='hive.metastore.location', port=9083)
>>> with client as c:
...    all_partitions = c.get_partitions(db_name='default',
...                                      tbl_name='test_table_with_partitions', 
...                                      max_parts=24 * 365 * 3)
...
>>> print([{'dt': part.values[0], 'location': part.sd.location} for part in all_partitions])
[{'dt': '20210504', 
  'location': 'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210504'}, 
{'dt': '20210505', 
 'location': 'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210505'}]

如果您将 Airflow 与 apache.hive extra 一起安装，您创建 hmsclient 使用来自 Airflow Connections 的数据非常容易：

hive_hook = HiveMetastoreHook()
with hive_hook.metastore as hive_client:
    ... your code goes here ...

如何列出配置单元表中所有分区的HDFS位置？

3 个答案:

演示