我在 hive 中有各种表,最少 0 到最多 4 个分区列。
下面是分区范围从 0 到 4 的几个表的 HDFS 表示。
-- type-0 <no partitions>
hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz
-- type-1 <1 partition column in table = dt>
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz
-- type-2 <2 partition columns in table = dt, hh>
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz
-- type-3 <3 partition columns in table = client, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz
-- type-4 <4 partition columns in table = service, geo, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz
类型 0 到 4 的预期输出 根据 markp-fuso 的要求
DBName TableName MIN_PARTITION(s) MAX_PARTITION(s)
test_db test_tbl_0
test_db_a test_tbl_1 dt=2020-11-14 dt=2020-12-16
test_db_b test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03/
test_db_c test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04
test_db_d test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21
以下是我为 type-2
尝试的内容。
## Getting Minimum and Maximum partition lines,
### here i am removing lines of hdfs output like 'Found 20 items'>
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print $8}' | (head -n1 && tail -n1)
/*
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22
*/
## Here i am further trying to simplify the output
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print $8}' | (head -n1 && tail -n1) | awk -F'/' '{print $(NF-2),$(NF-1),$NF}' | sed ':a;N;$!ba;s/\n/ /g'
/*
test_tbl_2 dt=2019-03-12 hh=00 test_tbl_2 dt=2021-07-28 hh=22
*/
正如我们在上面看到的,我得到以下格式的输出。
TableName MIN_PARTITION(s) TableName MAX_PARTITION(s)
同样上面的方法我只在有 2 个分区的表上测试过,是否有任何通用的 bash
hack 给我以下格式并且不管有多少个分区?
DBName TableName MIN_PARTITION(s) MAX_PARTITION(s)
1 个答案:
答案 0 :(得分:1)
更新:问题更新了更多样本输入以及匹配的(期望的)输出
假设:
- 给定 db/table 对的输入在连续行上,因此我们可以在耗尽给定 db/table 对的输入时生成输出(否则我们需要将所有数据存储在内存中 - 例如,数组 - 和然后在整个输入流用完后打印所有输出)
- 输出格式有 4 列:
DBName TableName MinPartition MaxPartition
- 如果 db/table 对只有一行输入,那么 min 和 max 列将包含相同的值
- 使用
/
作为字段分隔符,将忽略“最后一个”字段(示例输入中的 __SNAPPY.gz
)
用于演示目的的示例输入:
$ cat hdfs.input
# no min/max for test_db/test_tbl_0
hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22/__SNAPPY.gz
# min=max for test_db_b/test_tbl_7
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_7/dt=2021-07-28/hh=22/__SNAPPY.gz
一个awk
想法:
awk -F'/' '
function printline() {
if ( dbname != "") print dbname, tabname, minpart, maxpart
minpart = maxpart = ""
}
/^hdfs/ { if ( $7 != dbname || $8 != tabname )
printline()
dbname = $7
tabname = $8
if ( $10 == "" ) {
minpart = maxpart = ""
next
}
pfx = ""
currpart = ""
for (i=9; i<NF; i++) {
currpart = currpart pfx $i
pfx=FS
}
minpart = ( (minpart == "") || (currpart < minpart) ) ? currpart : minpart
maxpart = ( (maxpart == "") || (currpart > maxpart) ) ? currpart : maxpart
}
END { printline() }
' hdfs.input
这会产生:
test_db test_tbl_0
test_db_a test_tbl_1 dt=2020-11-14 dt=2020-12-16
test_db_b test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03
test_db_c test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04
test_db_d test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21
test_db_b test_tbl_2 dt=2019-03-12/hh=00 dt=2021-07-28/hh=22
test_db_b test_tbl_7 dt=2021-07-28/hh=22 dt=2021-07-28/hh=22