我正在使用Hadoop 0.20.2,并且正在使用旧的API。我正在尝试将数据块发送给映射器,而不是一次发送一行(数据覆盖多行)。我试图让我们使用NLineInputFormat来设置一次获得多少行,但是映射器一次只能接收一行。我很确定我有正确的代码。有什么理由不能解决这个问题吗?
供您参考,
JobConf conf = new JobConf(WordCount.class);
conf.setInt(“mapred.line.input.format.linespermap”,2);
conf.setInputFormat(NLineInputFormat.class);
基本上,我正在使用http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v1.0中的示例代码,只更改了TextInputFormat。
提前致谢
答案 0 :(得分:4)
NLineInputFormat旨在确保映射器都接收相同数量的输入记录(除了每个文件的拆分的最后部分)。
因此,通过将输入属性更改为2,每个映射器应该(最多)接收2个输入对,而不是一次接收2个输入行(这是我认为您正在寻找的)。
您应该能够通过查看每个地图任务的计数器来确认这一点,“地图输入记录”应为大多数地图制作者报告2
答案 1 :(得分:0)
我最近解决了这个问题,只需创建我自己的InputFormat来覆盖NLineInputFormat并实现自定义的MultiLineRecordReader而不是默认的LineReader。
我选择扩展NLineInputFormat,因为我想保证每次拆分都有N行。
此记录阅读器几乎与http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/
一样我修改的唯一内容是现在使用新API的maxLineLength
属性,以及从NLineInputFormat' s NLINESTOPROCESS
内容中读取的setNumLinesPerSplit()
的值硬编码(更灵活)。
结果如下:
public class MultiLineInputFormat extends NLineInputFormat{
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context) {
context.setStatus(genericSplit.toString());
return new MultiLineRecordReader();
}
public static class MultiLineRecordReader extends RecordReader<LongWritable, Text>{
private int NLINESTOPROCESS;
private LineReader in;
private LongWritable key;
private Text value = new Text();
private long start =0;
private long end =0;
private long pos =0;
private int maxLineLength;
@Override
public void close() throws IOException {
if (in != null) {
in.close();
}
}
@Override
public LongWritable getCurrentKey() throws IOException,InterruptedException {
return key;
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
if (start == end) {
return 0.0f;
}
else {
return Math.min(1.0f, (pos - start) / (float)(end - start));
}
}
@Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)throws IOException, InterruptedException {
NLINESTOPROCESS = getNumLinesPerSplit(context);
FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
this.maxLineLength = conf.getInt("mapreduce.input.linerecordreader.line.maxlength",Integer.MAX_VALUE);
FileSystem fs = file.getFileSystem(conf);
start = split.getStart();
end= start + split.getLength();
boolean skipFirstLine = false;
FSDataInputStream filein = fs.open(split.getPath());
if (start != 0){
skipFirstLine = true;
--start;
filein.seek(start);
}
in = new LineReader(filein,conf);
if(skipFirstLine){
start += in.readLine(new Text(),0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
value.clear();
final Text endline = new Text("\n");
int newSize = 0;
for(int i=0;i<NLINESTOPROCESS;i++){
Text v = new Text();
while (pos < end) {
newSize = in.readLine(v, maxLineLength,Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),maxLineLength));
value.append(v.getBytes(),0, v.getLength());
value.append(endline.getBytes(),0, endline.getLength());
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
}
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
}
}