如何从映射器获取Hadoop输入文件名?

时间:2013-09-13 16:58:11

标签: ruby hadoop mapper

Hadoop流通过环境变量使每个地图任务都可以使用文件名。

的Python:

os.environ["map.input.file"]

爪哇:

System.getenv(“map.input.file”).

Ruby怎么样?

mapper.rb
#!/usr/bin/env ruby

STDIN.each_line do |line|
  line.split.each do |word|
    word = word[/([a-zA-Z0-9]+)/] 
    word = word.gsub(/ /,"")
    puts [word, 1].join("\t")
  end
end

puts ENV['map.input.file']

3 个答案:

答案 0 :(得分:0)

怎么样:

ENV['map.input.file']

Ruby允许您轻松地分配ENV哈希:

ENV['map.input.file'] = '/path/to/file'

答案 1 :(得分:0)

所有JobConf变量都通过hadoop-streaming放入环境变量中。通过将0-9 A-Z a-z以外的任何字符转换为_,可以使变量名称“安全”。

所以map.input.file => map_input_file

尝试:puts ENV['map_input_file']

答案 2 :(得分:0)

使用op的输入,我尝试了mapper:

hadoop fs -rmr /user/itsjeevs/wc && 
hadoop jar $STRMJAR  -files /home/jejoseph/wc_mapper.py,/home/jejoseph/wc_reducer.py \
    -mapper wc_mapper.py  \
    -reducer wc_reducer.py \
    -numReduceTasks 10  \
    -input "/data/*"  \
    -output wc

和使用命令的标准wordcount reducer:

 16/03/10 15:21:32 INFO mapreduce.Job: Task Id : attempt_1455931799889_822384_m_000043_0, Status : FAILED
Error: java.io.IOException: Stream closed
    at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)
    at java.io.OutputStream.write(OutputStream.java:116)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

16/03/10 15:21:32 INFO mapreduce.Job: Task Id : attempt_1455931799889_822384_m_000077_0, Status : FAILED
Error: java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:345)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

失败并显示错误:

{{1}}

不确定发生了什么。