如何使用cli从hive表中获取最小和最大分区值?

时间:2021-07-29 13:58:37

标签: bash shell awk hive

我在 hive 中有各种表,最少 0 到最多 4 个分区列。

下面是分区范围从 0 到 4 的几个表的 HDFS 表示。

-- type-0 <no partitions>
hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz


-- type-1 <1 partition column in table  = dt>
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz


-- type-2 <2 partition columns in table = dt, hh>
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz


-- type-3 <3 partition columns in table = client, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz


-- type-4 <4 partition columns in table = service, geo, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz   
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz

类型 0 到 4 的预期输出 根据 markp-fuso 的要求

DBName  TableName  MIN_PARTITION(s) MAX_PARTITION(s)
test_db    test_tbl_0

test_db_a  test_tbl_1 dt=2020-11-14    dt=2020-12-16

test_db_b  test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03/

test_db_c  test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04

test_db_d  test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21

以下是我为 type-2

尝试的内容。

## Getting Minimum and Maximum partition lines, 
### here i am removing lines of hdfs output like 'Found 20 items'>
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print $8}' | (head -n1 && tail -n1)
/*
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22
*/

## Here i am further trying to simplify the output 
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print $8}' | (head -n1 && tail -n1) | awk -F'/' '{print $(NF-2),$(NF-1),$NF}' | sed ':a;N;$!ba;s/\n/ /g'
/*
test_tbl_2 dt=2019-03-12 hh=00  test_tbl_2 dt=2021-07-28 hh=22
*/

正如我们在上面看到的,我得到以下格式的输出。

TableName  MIN_PARTITION(s)  TableName  MAX_PARTITION(s)   

同样上面的方法我只在有 2 个分区的表上测试过,是否有任何通用的 bash hack 给我以下格式并且不管有多少个分区?

DBName  TableName  MIN_PARTITION(s) MAX_PARTITION(s)  

1 个答案:

答案 0 :(得分:1)

更新:问题更新了更多样本输入以及匹配的(期望的)输出

假设:

  • 给定 db/table 对的输入在连续行上,因此我们可以在耗尽给定 db/table 对的输入时生成输出(否则我们需要将所有数据存储在内存中 - 例如,数组 - 和然后在整个输入流用完后打印所有输出)
  • 输出格式有 4 列:DBName TableName MinPartition MaxPartition
  • 如果 db/table 对只有一行输入,那么 min 和 max 列将包含相同的值
  • 使用 / 作为字段分隔符,将忽略“最后一个”字段(示例输入中的 __SNAPPY.gz

用于演示目的的示例输入:

$ cat hdfs.input
# no min/max for test_db/test_tbl_0

hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22/__SNAPPY.gz

# min=max for test_db_b/test_tbl_7

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_7/dt=2021-07-28/hh=22/__SNAPPY.gz

一个awk想法:

awk -F'/' '
function printline() {
        if ( dbname != "") print dbname, tabname, minpart, maxpart
        minpart = maxpart = ""
}
/^hdfs/ { if ( $7 != dbname || $8 != tabname )
             printline()
          dbname = $7
          tabname = $8
          if ( $10 == "" ) {
             minpart = maxpart = ""
             next
          }
          pfx = ""
          currpart = ""
          for (i=9; i<NF; i++) {
              currpart = currpart pfx $i
              pfx=FS
          }
          minpart = ( (minpart == "") || (currpart < minpart) ) ? currpart : minpart
          maxpart = ( (maxpart == "") || (currpart > maxpart) ) ? currpart : maxpart
        }
END     { printline() }
' hdfs.input

这会产生:

test_db test_tbl_0
test_db_a test_tbl_1 dt=2020-11-14 dt=2020-12-16
test_db_b test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03
test_db_c test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04
test_db_d test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21
test_db_b test_tbl_2 dt=2019-03-12/hh=00 dt=2021-07-28/hh=22
test_db_b test_tbl_7 dt=2021-07-28/hh=22 dt=2021-07-28/hh=22