Question

我正在使用HIVE对S3上的原始数据进行ETL过程。我生成结构化输出数据，在加载到另一个数据库（redshift）之前对其进行排序。数据需要按照可管理块的排序顺序加载到redshift中，例如每块5-10亿行，其中总数据集为100亿条记录。

我正在寻找一种方法让hive对数据进行排序，然后将其分解为较小的可管理块，这些块可以按排序顺序单独上传。到目前为止，我还没有能够提出一种允许我这样做的方法。蜂巢中的减速器数量被强制为1，我使用了＃34; Oder By＆＃34;条款，所以我得到一个庞大的文件！ 我无法将这么大的文件从S3中移出来解压缩/拆分/重新压缩/重新加载，因为我没有地方可以做到这一切。

使用＆＃34; Cluster By＆＃34;生成内部排序的块，但不保证块之间的序列。

sort-by键是一个复合的字母数字键，而非重复计数太大而无法进行分区。

群集/分发问题：

根据我的理解，群集和分发选项的问题在于分发基于分发键的哈希发生。如果x < y，则不保证hash（x）小于hash（y）。因此，在生成的文件中，数据不会被排序。

Answer 1

从配置单元加载S3： 你可以使用LOCATION 's3://<bucket>/etc为配置单元指定一个外部文件（使用Order by hive时生成一个大文件），然后hive将它直接放入S3。

手动加载：使用Sort By时，应对一个reducer中的数据进行排序。您是否尝试使用单独的数据，以便在排序时通过某个键分发数据。

应该选择分发密钥，使得应该在一个存储桶中的所有记录都不会使用任何其他文件。

Answer 2

您可以尝试对表进行分组，这样可以创建一些近似相同大小的分区，这些分区更容易操作。

 Create table mytable (
 record_ID string,
 var1 double
 )
 clustered by record_ID into 100 buckets;


set hive.enforce.bucketing = true;
from my_other_table
insert overwrite table mytable
select *;

此外，您也可以创建一个随机数和分区。使用hive中的random（）udf也同样容易。

Answer 3

一种可能的解决方案可能是在完整排序输出上使用bash split命令将其分解为较小的文件。

以下内容摘自手册页：

NAME
       split - split a file into pieces

SYNOPSIS
       split [OPTION]... [INPUT [PREFIX]]

DESCRIPTION
       Output  fixed-size  pieces  of  INPUT  to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'.  With no INPUT, or when INPUT is -, read
       standard input.

       Mandatory arguments to long options are mandatory for short options too.

       -a, --suffix-length=N
              use suffixes of length N (default 2)

       -b, --bytes=SIZE
              put SIZE bytes per output file

       -C, --line-bytes=SIZE
              put at most SIZE bytes of lines per output file

       -d, --numeric-suffixes
              use numeric suffixes instead of alphabetic

       -l, --lines=NUMBER
              put NUMBER lines per output file

       --verbose
              print a diagnostic just before each output file is opened

       --help display this help and exit

       --version
              output version information and exit

       SIZE may be (or may be an integer optionally followed by) one of following: KB 1000, K 1024, MB 1000*1000, M 1024*1024, and so on for G, T, P, E, Z, Y.

所以，比如：

split -l 5000000000 filename

可能有用。

HIVE - 将大型有序查询结果集拆分为多个顺序文件

3 个答案: