在使用Hadoop处理大型数据集时,最好的方法是在Mapper中对数据进行排序

时间:2018-07-12 07:33:41

标签: java hadoop mapreduce hadoop2

我正在尝试使用Hadoop从庞大的数据集中查找十大电影。我正在使用Map Reduce方法。我已经使用本地集合即TreeMap对数据进行排序,但不建议使用此方法。我可以知道在Mapper中处理大量数据时对数据进行排序的正确方法吗?我正在提供我的Mapper和Reducer代码

映射器代码

public class HighestViewedMoviesMapper extends Mapper<Object, Text, NullWritable, Text> {
    private TreeMap<Integer, Text> highestView = new TreeMap<Integer, Text>();

    @Override
    public void map( Object key, Text values, Context context ) throws IOException, InterruptedException {
        String data = values.toString();
        String[] field = data.split( "::", -1 );
        if ( null != field && field.length == 2 ) {
            int views = Integer.parseInt( field[1] );
            highestView.put( views, new Text( field[0] + "::" + field[1] ) );
            if ( highestView.size() > 10 ) {
                highestView.remove( highestView.firstKey() );
            }
        }
    }

    @Override
    protected void cleanup( Context context ) throws IOException, InterruptedException {
        for ( Map.Entry<Integer, Text> entry : highestView.entrySet() ) {
            context.write( NullWritable.get(), entry.getValue() );
        }
    }
}

减速器代码

public class HighestViewMoviesReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
    private TreeMap<Integer, Text> highestView = new TreeMap<Integer, Text>();

    public void reduce( NullWritable key, Iterable<Text> values, Context context )
        throws IOException, InterruptedException {
        for ( Text value : values ) {
            String data = value.toString();
            String[] field = data.split( "::", -1 );
            if ( field.length == 2 ) {
                highestView.put( Integer.parseInt( field[1] ), new Text( value ) );
                if ( highestView.size() > 10 ) {
                    highestView.remove( highestView.firstKey() );
                }
            }
        }
        for ( Text t : highestView.descendingMap().values() ) {
            context.write( NullWritable.get(), t );
        }
    }
}

有人可以告诉我这样做的最佳方法吗?预先感谢。

0 个答案:

没有答案