Question

我正在使用代码[1]使用python启动MapReduce作业。问题是我在stderr字段[3]中获得了正确的输出数据，而不是进入stdout字段[2]。为什么我在stderr字段中获取正确的数据？我正确使用Popen.communicate吗？有没有更好的方法来使用python（而不是jython）启动java执行？

[1]我用来在Hadoop中启动作业的片段

command=/home/xubuntu/Programs/hadoop/bin/hadoop jar /home/xubuntu/Programs/hadoop/medusa-java.jar mywordcount -Dfile.path=/home/xubuntu/Programs/medusa-2.0/temp/1443004585/job.attributes /input1 /output1

try:
    process = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out,err = process.communicate()
    print ("Out %s" % out)
    print ("Error %s" % err)

    if len(err) > 0:  # there is an exception
        # print("Going to launch exception")
        raise ValueError("Exception:\n" + err)
except ValueError as e:
    return e.message

return out

[2] stdoutdata中的输出：

[2015-09-23 07:16:13,220: WARNING/Worker-17] Out My Setup
My get job name
My get job name
My get job name
org.apache.hadoop.mapreduce.lib.partition.HashPartitioner
---> Job 0: /input1, : /output1-1443006949
10.10.5.192
10.10.5.192:8032

[3] stderrdata字段中的输出：

[2015-09-23 07:16:13,221: WARNING/Worker-17] Error 15/09/23 07:15:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/23 07:15:53 INFO client.RMProxy: Connecting to ResourceManager at  /10.10.5.192:8032
15/09/23 07:15:54 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/09/23 07:15:54 INFO input.FileInputFormat: Total input paths to process : 4
15/09/23 07:15:54 INFO mapreduce.JobSubmitter: number of splits:4
15/09/23 07:15:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442999930174_0009
15/09/23 07:15:54 INFO impl.YarnClientImpl: Submitted application application_1442999930174_0009
15/09/23 07:15:54 INFO mapreduce.Job: The url to track the job: http://hadoop-coc-1:9046/proxy/application_1442999930174_0009/
15/09/23 07:15:54 INFO mapreduce.Job: Running job: job_1442999930174_0009
15/09/23 07:16:00 INFO mapreduce.Job: Job job_1442999930174_0009 running in uber mode : false
15/09/23 07:16:00 INFO mapreduce.Job:  map 0% reduce 0%
15/09/23 07:16:13 INFO mapreduce.Job:  map 100% reduce 0%
15/09/23 07:16:13 INFO mapreduce.Job: Job job_1442999930174_0009 completed successfully
15/09/23 07:16:13 INFO mapreduce.Job: Counters: 30
    File System Counters
            FILE: Number of bytes read=0
            FILE: Number of bytes written=423900
            FILE: Number of read operations=0
            FILE: Number of large read operations=0
            FILE: Number of write operations=0
            HDFS: Number of bytes read=472
            HDFS: Number of bytes written=148
            HDFS: Number of read operations=20
            HDFS: Number of large read operations=0
            HDFS: Number of write operations=8
    Job Counters 
            Launched map tasks=4
            Data-local map tasks=4
            Total time spent by all maps in occupied slots (ms)=41232
            Total time spent by all reduces in occupied slots (ms)=0
            Total time spent by all map tasks (ms)=41232
            Total vcore-seconds taken by all map tasks=41232
            Total megabyte-seconds taken by all map tasks=42221568
    Map-Reduce Framework
            Map input records=34
            Map output records=34
            Input split bytes=406
            Spilled Records=0
            Failed Shuffles=0
            Merged Map outputs=0
            GC time elapsed (ms)=532
            CPU time spent (ms)=1320
            Physical memory (bytes) snapshot=245039104
            Virtual memory (bytes) snapshot=1272741888
            Total committed heap usage (bytes)=65273856
    File Input Format Counters

Answer 1

Hadoop（特别是Log4j）只会将所有[INFO]条消息记录到stderr。从他们的entry配置：

默认情况下，Hadoop会将消息记录到Log4j。 Log4j通过类路径上的log4j.properties进行配置。此文件定义记录的内容和位置。对于应用程序，默认根记录器是＆＃34; INFO，console＆＃34;，，它将INFO级别以上的所有消息记录到控制台的stderr 。服务器记录到＆＃34; INFO，DRFA＆＃34;，它记录到每天滚动的文件。日志文件名为$ HADOOP_LOG_DIR / hadoop- $ HADOOP_IDENT_STRING-.log

我从未尝试过将日志重定向到stdout，所以我无法真正帮助他，but a promising answer来自其他用户建议：

// Answer by Rajkumar Singh
// to get your stdout and log message on the console you can use apache
// commons logging framework in to your mapper and reducer.

public class MyMapper extends Mapper<..,...,..,...>{
public static final Log log = LogFactory.getLog(MyMapper.class)
public void map() throws Exception{
// Log to stdout file
System.out.println("Map key "+ key);

//log to the syslog file
log.info("Map key "+ key);

if(log.isDebugEanbled()){
log.debug("Map key "+ key);
}
context.write(key,value);
}

我建议试一试。

在python中调用java时输出为stderr

1 个答案: