Oozie - 通过动作配置在DistCp上设置策略

时间:2016-06-22 18:35:31

标签: hadoop oozie distcp

我有一个带有distCp动作的工作流程,它运行得相当好。但是,现在我正在尝试更改复制策略,并且无法通过操作参数执行此操作。该主题的文档相当薄弱,查看distCp动作执行程序的源代码没有帮助。

如果从命令行运行distCp,我可以使用命令行参数 -strategy {uniformsize|dynamic}设置复制策略。

使用该逻辑我尝试在oozie动作中执行此操作。

<action name="distcp-run" retry-max="3" retry-interval="1">
    <distcp xmlns="uri:oozie:distcp-action:0.2">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapreduce.job.queuename</name>
                <value>${poolName}</value>
            </property>
        </configuration>
        <arg>-Dmapreduce.job.queuename=${poolName}</arg>
        <arg>-Dmapreduce.job.name=distcp-s3-${wf:id()}</arg>
        <arg>-update</arg>
        <arg>-strategy dynamic</arg>
        <arg>${region}/d=${day2HoursAgo}/h=${hour2HoursAgo}</arg>
        <arg>${region2}/d=${day2HoursAgo}/h=${hour2HoursAgo}</arg>
        <arg>${region3}/d=${day2HoursAgo}/h=${hour2HoursAgo}</arg>
        <arg>${nameNode}${rawPath}/${partitionDate}</arg>
    </distcp>
    <ok to="join-distcp-steps"/>
    <error to="error-report"/>
</action>

但是,执行时操作失败。

来自stdout:

...>>> Invoking Main class now >>>

Fetching child yarn jobs
tag id : oozie-1d1fa70383587ae625b6495e30a315f7
Child yarn jobs are found - 
Main class        : org.apache.hadoop.tools.DistCp
Arguments         :
                    -Dmapreduce.job.queuename=merged
                    -Dmapreduce.job.name=distcp-s3-0000019-160622133128476-oozie-oozi-W
                    -update
                    -strategy dynamic
                    s3a://myfirstregion/d=21/h=17,s3a://mysecondregion/d=21/h=17,s3a://ttv-logs-eu/tsv/clickstream-clean/y=2016/m=06/d=21/h=17,s3a://mythirdregion/d=21/h=17
                    hdfs://myurl:8020/data/raw/2016062117
found Distcp v2 Constructor
                    public org.apache.hadoop.tools.DistCp(org.apache.hadoop.conf.Configuration,org.apache.hadoop.tools.DistCpOptions) throws java.lang.Exception

<<< Invocation of Main class completed <<<

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.DistcpMain], main() threw exception, Returned value from distcp is non-zero (-1)
java.lang.RuntimeException: Returned value from distcp is non-zero (-1)
    at org.apache.oozie.action.hadoop.DistcpMain.run(DistcpMain.java:66)...

看着系统日志,它似乎抓住了-strategy动态,并试图把它放在源路径数组中:

2016-06-22 14:11:18,617 INFO [uber-SubtaskRunner] org.apache.hadoop.tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[-strategy dynamic, s3a://myfirstregion/d=21/h=17,s3a:/mysecondregion/d=21/h=17,s3a:/ttv-logs-eu/tsv/clickstream-clean/y=2016/m=06/d=21/h=17,s3a:/mythirdregion/d=21/h=17], targetPath=hdfs://myurl:8020/data/raw/2016062117, targetPathExists=true, preserveRawXattrs=false, filtersFile='null'}
2016-06-22 14:11:18,624 INFO [uber-SubtaskRunner] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at sandbox/10.191.5.128:8032
2016-06-22 14:11:18,655 ERROR [uber-SubtaskRunner] org.apache.hadoop.tools.DistCp: Invalid input: 
org.apache.hadoop.tools.CopyListing$InvalidInputException: -strategy dynamic doesn't exist

所以从DistCpOptions有一个copyStrategy但它被设置为默认的uniformsize值。 我试图首先移动参数,但是两个-Dmapreduce参数最终都在源路径中(但是-update没有)。

如何通过Oozie工作流程配置将复制策略设置为动态?

感谢。

1 个答案:

答案 0 :(得分:1)

查看代码,似乎无法通过配置设置策略。您可以使用distcp-action操作,而不是使用map-reduce,您可以根据需要对其进行配置。

Oozie MapReduce Cookbook有例子。

查看Distcp代码相关部分位于createJob() Job job = Job.getInstance(getConf()); job.setJobName(jobName); job.setInputFormatClass(DistCpUtils.getStrategy(getConf(), inputOptions)); job.setJarByClass(CopyMapper.class); configureOutputFormat(job); job.setMapperClass(CopyMapper.class); job.setNumReduceTasks(0); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputFormatClass(CopyOutputFormat.class); job.getConfiguration().set(JobContext.MAP_SPECULATIVE, "false"); job.getConfiguration().set(JobContext.NUM_MAPS, String.valueOf(inputOptions.getMaxMaps())); 附近。

map-reduce

上面的代码并不是您需要的所有内容,您需要查看distcp源代码才能完成所有工作。

因此,您需要在InputFormatClass操作中自行配置所有属性。这样您就可以设置使用strategy设置的InputFormatClass

您可以在distcp属性文件line 237中查看org.apache.hadoop.tools.mapred.lib.DynamicInputFormat的可用属性。

您需要的是// Create the XHR object. function createCORSRequest(method, url) { var xhr = new XMLHttpRequest(); if ("withCredentials" in xhr) { // XHR for Chrome/Firefox/Opera/Safari. xhr.open(method, url, true); } else if (typeof XDomainRequest != "undefined") { // XDomainRequest for IE. xhr = new XDomainRequest(); xhr.open(method, url); } else { // CORS not supported. xhr = null; } return xhr; } // Helper method to parse the title tag from the response. function getTitle(text) { return text.match('<title>(.*)?</title>')[1]; } // Make the actual CORS request. function makeCorsRequest() { // All HTML5 Rocks properties support CORS. var url = 'http://updates.html5rocks.com'; var xhr = createCORSRequest('GET', url); if (!xhr) { alert('CORS not supported'); return; } // Response handlers. xhr.onload = function() { var text = xhr.responseText; var title = getTitle(text); alert('Response from CORS request to ' + url + ': ' + title); }; xhr.onerror = function() { alert('Woops, there was an error making the request.'); }; xhr.send(); }