Question

我需要同时处理多个月的数据。那么，是否有一个选项可以将多个文件夹指向外部表？例如 Create external table logdata(col1 string, col2 string........) location s3://logdata/april, s3://logdata/march

Answer 1

简单回答：不，创建过程中Hive location表的external必须是唯一的，Metastore需要这样才能了解您的桌子所在的位置。

话虽这么说，你可以放弃使用分区：你可以为每个分区指定一个location，这似乎是你最终想要的，因为你按月拆分。

所以像这样创建你的表：

create external table logdata(col1 string, col2 string) partitioned by (month string) location 's3://logdata'

然后你可以像这样添加分区：

alter table logdata add partition(month='april') location 's3://logdata/april'

你每个月都这样做，现在你可以查询你的表，指定你想要的任何分区，而Hive只会查看你真正想要数据的目录（例如，如果你只处理4月和6月， Hive不会加载））

Answer 2

我检查了你的情景。我认为您可以通过使用多个load inpath语句来启用多个位置来实现这一点。以下是我参加测试的步骤。

hive> create external table xxx (uid int, name string, dept string) row format delimited fields terminated by '\t' stored as textfile;
hive> load data inpath '/input/tmp/user_bckt' into table xxx;
hive> load data inpath '/input/user_bckt' into table xxx;
hive> select count(*) from xxx;
10
hive> select * from xxx;
1   ankur   abinitio
2   lokesh  cloud
3   yadav   network
4   sahu    td
5   ankit   data
1   ankur   abinitio
2   lokesh  cloud
3   yadav   network
4   sahu    td
5   ankit   data

如果这不适合你，请告诉我

编辑：我刚刚检查了数据是否正在移动到hive仓库中，而外部表格数据的概念则留在原始位置，如下所示：

hduser@hadoopnn:~$ hls /input/tmp
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/05 14:47:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hduser hadoop         93 2014-10-04 18:54 /input/tmp/dept_bckt
-rw-r--r--   1 hduser hadoop         71 2014-10-04 18:54 /input/tmp/user_bckt
hduser@hadoopnn:~$ hcp /input/tmp/user_bckt /input/user_bckt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/05 14:47:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hduser@hadoopnn:~$ logout
Connection to nn closed.
hduser@hadoopdn2:~$ hls /input/tmp/
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/05 15:05:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hduser hadoop         93 2014-10-04 18:54 /input/tmp/dept_bckt
hduser@hadoopdn2:~$ hls /hive/wh/xxx
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/05 15:21:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hduser hadoop         71 2014-10-04 18:54 /hive/wh/xxx/user_bckt
-rw-r--r--   1 hduser hadoop         71 2014-10-05 14:47 /hive/wh/xxx/user_bckt_copy_1

我目前正在调查这个问题，一旦完成就会回来。

Answer 3

不，位置必须是单个目录。但是，您可以将位置更改为指向多个目录。但是，当你查询表时，这将是一个错误。

实施例： 1.改变表格的位置如下。我输入了两个以'：'分隔的hdfs目录，也尝试了'，'和';'。它很成功。

hive> alter table ext set location 'hdfs:///solytr:/ext';
OK
Time taken: 0.086 seconds

但是，当查询表时，它导致失败。

蜂房＆GT; select * from ext;
好异常java.io.IOException失败：java.lang.IllegalArgumentException：来自hdfs的路径名/ solytr：/ ext：/ solytr：/ ext不是有效的DFS文件名。
所用时间：0.057秒

Answer 4

查看SymlinkTextInputFormat / https://issues.apache.org/jira/browse/HIVE-1272。认为这可以解决您的问题。只需要维护一个包含所有位置的单独文本文件！

另见https://issues.apache.org/jira/browse/HIVE-951未解决但会成为解决方案！

我可以将多个位置指向同一个蜂巢外部表吗？

4 个答案: