我的映射器有一个输出:
Mapper: KEY, VALUE(Timestamp, someOtherAttrbibutes)
我的减速机确实收到了:
Reducer: KEY, Iterable<VALUE(Timestamp, someOtherAttrbibutes)>
我希望按{strong>时间戳属性排序Iterable<VALUE(Timestamp, someOtherAttrbibutes)>
。是否有可能实施它?
我想避免在Reducer代码中手动排序。 http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
我必须从Iterable“深度复制”所有对象,这会导致巨大的内存开销。 :(((
答案 0 :(得分:6)
这相对容易,您需要为VALUE
类编写比较器类。
请仔细查看:http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/,尤其是二级排序的解决方案部分。
答案 1 :(得分:-1)
您需要为VALUE类编写比较器类。
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
sdf.setTimeZone(TimeZone.getTimeZone("UTC"));
List<String> list = new ArrayList<String>();
for (Text val : values) {
list.add(val.toString());
}
Collections.sort(list, new Comparator<String>() {
public int compare(String s1, String s2) {
String str1[] = s1.split(",");
String str2[] = s2.split(",");
int time1 = 0;
int time2 = 0;
try {
time1 = (int)(sdf.parse(str1[0]).getTime());
time2 = (int) (sdf.parse(str2[0]).getTime());
} catch (ParseException e) {
e.printStackTrace();
} finally {
return time1 - time2;
}
}
});
for(int i = 0; i < list.size(); ++i)
context.write(key, new Text(list.get(i)));
}