我使用 Oozie Sqoop Action 导入Datalake中的数据。 我需要为数据库源的每个表提供一个HDFS文件夹。我有300多张桌子。
我可以在工作流程中对所有300个Sqoop动作进行硬编码,但是工作流程对于Oozie配置来说太大了。
Error submitting job /user/me/workflow.xml
E0736: Workflow definition length [107,123] exceeded maximum allowed length [100,000]
拥有这样的大文件并不是一个好主意,因为它会使系统变慢(它保存在数据库中)并且难以维护。
问题是,如何为每个表名称调用子工作流程?
等效的shell脚本类似于:
while read TABLE; do
sqoop import --connect ${CONNECT} --username ${USERNAME} --password ${PASSWORD} --table ${TABLE} --target-dir ${HDFS_LOCATION}/${TABLE} --num-mappers ${NUM-MAPPERS}
done < tables.data
其中tables.data包含表名列表,该列表是数据库源表名的子集。例如:
TABLE_ONE
TABLE_TWO
TABLE_SIX
TABLE_TEN
这里是我想为每个表调用的子工作流程:
<workflow-app name="sub-workflow-import-table" xmlns="uri:oozie:workflow:0.5">
<start to="sqoop-import"/>
<action name="sqoop-import">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>sqoop import --connect ${CONNECT} --username ${USERNAME} --password ${PASSWORD} --table ${TABLE} --target-dir ${HDFS_LOCATION}/${TABLE} --num-mappers ${NUM-MAPPERS}</command>
</sqoop>
<ok to="end"/>
<error to="log-and-kill"/>
</action>
<end name="end"/>
<kill name="log-and-kill">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
</workflow-app>
如果您需要更高的精度,请告诉我。 谢谢! 大卫
答案 0 :(得分:3)
There's sadly no way to do this nicely in Oozie - you'd need to hardcode all 300 Sqoop actions into an Oozie XML. This is because Oozie deals with directed acyclic graphs, which means loops (like your shell script) don't have an Oozie equivalent.
However I don't think Oozie is the right tool here. Oozie requires one container per action to use as a launcher, which means your cluster will need to allocate 300 additional containers over the space of a single run. This can effectively deadlock a cluster as you end up in situations where launchers prevent the actual jobs running! I've worked on a large cluster with > 1000 tables and we used Bash there to avoid this issue.
If you do want to go ahead with this in Oozie, you can't avoid generating a workflow with 300 actions. I would do it as 300 actions rather than 300 calls to sub-workflows which each call one action, else you're going to generate even more overhead. You can either create this file manually, or preferably write some code to generate the Oozie workflow XML file given a list of tables. The latter is more flexible as it allows tables to be included or excluded on a per-run basis.
But as I initially said, I'd stick to Bash for this one unless you have a very very good reason.
答案 1 :(得分:0)
我的建议是为50个表导入创建每个工作流程。所以你有6个这样的人。将所有6个工作流调用为主工作流或父工作流的子工作流。通过这种方式,我们可以在一个点上控制,并且可以很容易地安排单个工作流程。