Question

我为hadoop 0.20.2编写了一个简单的地图任务，输入数据集由44个文件组成，每个文件大约3-5MB。任何文件的每一行都具有格式int,int。输入格式是默认的TextInputFormat，映射器的工作是将输入Text解析为整数。

任务运行后，hadoop框架的统计信息表明map任务的输入记录数仅为44.我试过调试，发现方法map的输入记录只是第一行每个文件。

有谁知道问题是什么，我在哪里可以找到解决方案？

先谢谢你。

修改1

输入数据由不同的map-reduce任务生成，其输出格式为TextOutputFormat<NullWritable, IntXInt>。 toString()的{{1}}方法应该提供一个IntXInt字符串。

修改2

我的映射器如下所示

int,int

编辑3

我刚刚检查过，映射器实际上只为每个文件读取1行，而不是整个文件作为一个static class MyMapper extends MapReduceBas implements Mapper<LongWritable, Text, IntWritable, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) { String[] s = value.toString().split(","); IntXInt x = new IntXInt(s[0], s[1]); output.collect(x.firstInt(), x.secondInt()); } }值。

Answer 1

InputFormat定义了如何将文件中的数据读入Mapper实例。默认的TextInputFormat读取文本文件行。它为每条记录发出的键是读取行的字节偏移量（作为LongWritable），值是直到终止'\ n'字符的行的内容（作为Text对象）。如果你有多个-line记录每个由$字符分隔的记录，你应该编写自己的InputFormat，将文件分解为拆分在这个字符上的记录。

Answer 2

我怀疑你的映射器将所有文本作为输入并打印输出。你能否展示你的Mapper类decleration和mapper函数decleration？即

static class MyMapper extends Mapper <LongWritable,Text,Text,Text>{ 
    public void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //do your mapping here

    }
}

我想知道这一行是否有不同之处

hadoop textinputformat每个文件只读一行

2 个答案: