Question

我正在探索sqoop将数据从配置单元发送到RDBMS。我不想一次又一次地发送相同的数据。我需要识别HDFS中的更改并仅发送自上次导出以来已更改的数据。实现此类增量导出逻辑的最佳方法是什么？我看到sqoop导入有增量逻辑选项;但是在出口时却看不到它。

非常感谢任何建议/建议。

Answer 1

您可以使用Hive（TABLE_NAME_CHANGED）中的更改记录创建新表或视图，并使用该记录导入RDBMS。

Answer 2

如果您在配置单元中有一个时间戳字段来识别增量，则可以通过以下方式实现增量导出。

每次导出数据之前，您都必须检查RDBMS中的最大时间戳，并使用它来创建导出文件。

##Checking the max date in RDBMS
#You can tweak with the command based on the stack thats produced by your sqoop 
mxdt=$(sqoop eval --connect 'jdbc:oracle:thin:@HOST:PORT/SSID' --username hadoop -password hadoop --query "select max(timestamp_filed) from schema.table" | awk "NR==6{print;exit}" | sed 's/|//g' | sed ''s/[^[:print:]]//g'' | sed 's/ //g')

#Based on the mxdt variable you can create a file from beeline/hive as below
beeline -u ${ConnString} --outputformat=csv2 --showHeader=false --silent=true --nullemptystring=true --incremental=true -e "select * from hiveSchema.hiveTable where timestamp > ${mxdt}" >> /SomeLocalPath/FileName.csv

#Copy file to hdfs

hdfs dfs -put /SomeLocalPath/FileName.csv2 /tmp/

#Now use the file in hdfs to do the sqoop export
sqoop export --connect 'jdbc:oracle:thin:@HOST:PORT/SSID' --username hadoop -password hadoop --export-dir '/tmp/FileName.csv' --table RDBMSSCHEMA.RDBMSTABLE --fields-terminated-by "," --lines-terminated-by "\n" -m 1 --columns "col1,col2,"

Hive to RDBMS增量导出

2 个答案: