Question

所以我遇到了一些麻烦。

我最初尝试使用sqoop从Postgres导入数据，并直接将其导入hive。使用以下命令

sqoop import --direct --connect jdbc:postgresql://xx.xxx.xxx.xx:yyyy/database --username username -P --table mytable -- --schema myschema  --hive-import  --hive-overwrite --hive-table mytable --verbose

这导致数据被转移到我的HDFS路径中但使用＆＃34; Show Tables＆＃34; （显示shemas; SHOW数据库;只是因为）在蜂巢中看不到任何东西。

所以我尝试了另一种方法，我将它直接导入HDFS，然后用数据创建一个HIVE表。

sqoop import --target-dir /path/to/mytable/ --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' --connect jdbc:postgresql://xx.xxx.xxx.xx:yyyy/database --username username -P --split-by id --query 'select * from mytable  where $CONDITIONS limit 100' -- --schema myschema  --verbose

输出文件如下＆＃34; col1＆＃34;，＆＃34; col2＆＃34;，＆＃34; col3＆＃34; ......

现在导入hive的命令就是这个

create external table tbl_staging_sms(
col1 int,
col2 bigint,
col3 text,
col4 text,
..
..
..
)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
location '/path/to/mytable';

主要问题是它们是具有自由流动文本的列，可以包含任何字符序列，包括默认为逗号的分隔符。（从postgres导入到sqoop时使用稀有分隔符是不够的，因为它是用户输入，他们可以输入他们想要的任何东西）如何从Postgres中的数据中查看HDFS中的HIVE查询，而不必担心列损坏。

我知道在示例中我给定的分隔符仍然无法按预期工作，因为双引号将被视为列的一部分但不确定还有什么要做。

任何想法都会很好。

提前致谢。

带有自由格式文本的Sqoop，HDFS，HIVE和Postgres

0 个答案: