Question

我每天有成千上万个文件从其他人放到目录中，每个文件的大小大约为400MB到1GB。

我想计算目录中的总行数。

我打算像下面那样进行地图缩小

映射器

public static class LineMapper
        extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);

    public void map(Object key, Text value, Context context
    ) throws IOException, InterruptedException {

        context.write("static_key", one);
    }
}

减速器

public static class IntSumReducer
        extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
    ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

但是考虑了一下，我感觉到减速器节点将变得不堪重负，因为只有1个键。

有办法避免这种情况吗？

Answer 1

如果要为此使用mapreduce，那么最好的选择是使用计数器。将您的映射器更改为这样，并将reducer的数量设置为0。

public static class LineMapper extends Mapper<Object, Text, Text, IntWritable>{

        enum MyCounters {
            TOTAL_COUNT;
        }


        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
                 context.getCounter(MyCounters.TOTAL_COUNT).increment(1L);
        }
}

Answer 2

通过Hive可以更快地完成此操作。概述以下一种可能的方法：

创建一个HDFS目录来保存数据

$ hadoop fs -mkdir /hive-data
$ hadoop fs -mkdir /hive-data/linecount

创建配置单元表格

hive> CREATE EXTERNAL TABLE linecount
(
  line string
)
LOCATION
  'hdfs:///hive-data/linecount'

将数据文件加载到HDFS中

$ hadoop fs -put a.txt hdfs:///hive-data/linecount
$ hadoop fs -put b.txt hdfs:///hive-data/linecount
$ hadoop fs -put c.txt hdfs:///hive-data/linecount

通过Hive查询计数

hive> select count(*) from linecount;

使用地图减少行数

2 个答案: