我正在与Hadoop MapRedue
合作,并提出了一个问题。
目前,我的映射器input KV type
为LongWritable, LongWritable type
和
output KV type
也是LongWritable, LongWritable type
。
InputFileFormat是SequenceFileInputFormat。
基本上我想要做的是将一个txt文件更改为SequenceFileFormat,以便我可以将它用于我的映射器。
我想做的是
输入文件是这样的
1\t2 (key = 1, value = 2)
2\t3 (key = 2, value = 3)
等等...
我查看了这个帖子How to convert .txt file to Hadoop's sequence file format,但重新定位TextInputFormat
只支持Key = LongWritable and Value = Text
有没有办法在KV = LongWritable, LongWritable
中获取txt和制作序列文件?
答案 0 :(得分:7)
当然,基本上与我在你链接的另一个帖子中讲述的方式相同。但是你必须实现自己的Mapper
。
快速为您解读:
public class LongLongMapper extends
Mapper<LongWritable, Text, LongWritable, LongWritable> {
@Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, LongWritable>.Context context)
throws IOException, InterruptedException {
// assuming that your line contains key and value separated by \t
String[] split = value.toString().split("\t");
context.write(new LongWritable(Long.valueOf(split[0])), new LongWritable(
Long.valueOf(split[1])));
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("Convert Text");
job.setJarByClass(LongLongMapper.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
// increase if you need sorting or a special number of files
job.setNumReduceTasks(0);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(LongWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path("/input"));
FileOutputFormat.setOutputPath(job, new Path("/output"));
// submit and wait for completion
job.waitForCompletion(true);
}
}
mapper函数中的每个值都会得到一行输入,所以我们只需用分隔符(tab)将其拆分并将其中的每一部分解析为long。
就是这样。