Question

我正在hive中做一些自动脚本的少量查询，我们发现我们需要时间清理表中的数据并插入新表。我们正在考虑什么可以更快？

public static Comparable[] findCommonElements(Comparable[][] collections) 
{
    team[] intersection = new team[50];

    int index = 0;
    int i = 0; //counter for collections[0]
    int j = 0; //counter for collections[1]
    int k = 0; //counter for collections[2]

    System.out.print("test");

    while (i < collections[0].length && j < collections[1].length && k < collections[2].length)
    {
        //if query value > collections[2] value, increment collections[2]
        if (collections[0][i].compareTo(collections[1][j]) > 0)
        {
            j++;
            incComparisons();

        }
        //if query value > collections[2] value, increment collections[2]
        else if (collections[0][i].compareTo(collections[2][k]) > 0)
        {
            k++;
            incComparisons();

        }
        else if (collections[0][i] == collections[1][j] && collections[0][i] == collections[2][k])
        {
            // add entry to intersection array
            intersection[index] = (team) collections[0][i];
            index++;
            incComparisons();


            // if the next item in each collection also matches, then add an extra instance of that item to the list
            if (collections[1][j + 1] == collections[0][i] || collections[2][k + 1] == collections[0][i])
            {
                intersection[index] = (team) collections[0][i];
                index++;
                incComparisons();

            }
            i++;
            j++;
            k++;
        }           
    }

    return intersection;
}

或者这样做更快：

INSERT OVERWRITE TABLE SOME_TABLE
    SELECT * FROM OTHER_TABLE;

运行查询的开销不是问题。由于我们也有创建脚本o。问题是，带有十亿行的DROP TABLE SOME_TABLE; CREATE TABLE SOME_TABLE (STUFFS); INSERT INTO TABLE SELECT * FROM OTHER_TABLE;比INSERT OVERWRITE快？

Answer 1

为了获得最大速度，我建议1）首先发出hadoop fs -rm -r -skipTrash table_dir/*以快速删除旧数据，而不将文件放入垃圾箱，因为INSERT OVERWRITE会将所有文件放入垃圾箱，而对于非常大的表格，这将花费大量时间。然后2）执行INSERT OVERWRITE命令。这也会更快，因为您不需要删除/创建表。

更新：

从Hive 2.3.0（HIVE-15880）开始，如果表格为TBLPROPERTIES ("auto.purge"="true")，则在对表格执行INSERT OVERWRITE查询时，表格的先前数据不会移至“废纸篓”。此功能仅适用于托管表。因此，使用自动清除的INSERT OVERWRITE将比rm -skipTrash + INSERT OVERWRITE或DROP + CREATE + INSERT更快地工作，因为它将是一个仅限Hive的命令。

Answer 2

一个边缘考虑因素是，如果更改架构，INSERT OVERWRITE将失败，而DROP + CREATE + INSERT不会失败。尽管这在大多数情况下不太可能适用，但是如果您要对工作流/表架构进行原型设计，那么可能值得考虑。

HIVE - INSERT OVERWRITE vs DROP TABLE + CREATE TABLE + INSERT INTO

2 个答案: