如何以MapReduce格式在一行中打印一些标记?

时间:2018-10-10 09:31:45

标签: java dictionary hadoop split stringtokenizer

我正在编写一个地图函数。我有一个文本文件为:

364.2   366.6   365.2   0   0   1   10421
364.2   366.6   365.2   0   0   1   10422

我想显示第1,3栏。这是我的代码,但显示了所有行。

public static class SumMap extends Mapper<Object, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text str = new Text();

    @Override
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer lineIter = new StringTokenizer(value.toString(), "\\r?\\n");
        while (lineIter.hasMoreTokens()) {
            StringTokenizer tokenIter = new StringTokenizer(lineIter.nextToken(), "\\s+");
            while (tokenIter.hasMoreTokens()) {
                String v1 = tokenIter.nextToken();
                String v2 = tokenIter.nextToken();
                String c1 = tokenIter.nextToken();
                String c2 = tokenIter.nextToken();
                str.set(v1+c1);
                context.write(str, one);
            }

        }
    }
}

在此代码中,第一个应按行("\\r?\\n")分隔,然后对于每一行,按数字或字符串或记号由("\\s+")分隔。最后,打印v1+c1。如何更改我的代码?

2 个答案:

答案 0 :(得分:0)

问题在于生成的令牌数和您正在访问的令牌数。在内部while循环中,生成的令牌数将为7。但是您一次只能访问4个令牌。您要做的是同时访问所有令牌。由于只需要1和3列,因此检索它们并将它们分别存储。

public static class SumMap extends Mapper<Object, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text str = new Text();

    @Override
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer tokenIter = new StringTokenizer(lineIter.nextToken(), "\\s+");
        while (tokenIter.hasMoreTokens()) {
            String c1 = tokenIter.nextToken();
            String c2 = tokenIter.nextToken();
            String c3 = tokenIter.nextToken();
            String c4 = tokenIter.nextToken();
            String c5 = tokenIter.nextToken();
            String c6 = tokenIter.nextToken();
            String c7 = tokenIter.nextToken();
            str.set(c1+c3);
            context.write(str, one);
        }
    }
}

主要:

    public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "sum");
        job.setJarByClass(SumMR.class);
        job.setMapperClass(SumMap.class);
//        job.setCombinerClass(IntSumReducer.class);
//        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        TextInputFormat.addInputPath(job, new Path(args[1]));
        FileOutputFormat.setOutputPath(job, new Path(args[2]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

这是修改后的代码。如果有问题,请告诉我!。

答案 1 :(得分:0)

如果使用TextInputFormat,则映射的键为行号,值为行内容。您不需要分割线。只需拆分每行:

@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    String[] vals = value.toString().split("\\s+");
    if (vals.length == 7) {
        context.write(new Text(vals[0] + vals[2]), one);
    }

}