我的数据集采用以下格式:
userID mediaID rating
我想找到在所有用户中收到高于阈值的任何一对mediaID的共同出现。为此,我按照几个示例来实现复合键。我写了一个PairKey类,它存储一个唯一的对,实现了compareTo和重写的hashCode并且等于......
public static class PairKey implements WritableComparable<PairKey> {
private Integer lowID;
private Integer highID;
public PairKey() {
this.lowID = -1;
this.highID = -1;
}
public PairKey(Integer one, Integer two) {
//should be impossible
if (one.equals(two)) {
throw new IllegalArgumentException("Cannot have a pair key with identical IDs");
}
if (one < two) {
lowID = one;
highID = two;
}
else {
lowID = two;
highID = one;
}
}
public Integer getLowID() {
return lowID;
}
public Integer getHighID() {
return highID;
}
public void setLowID(Integer _lowID) {
lowID = _lowID;
}
public void setHighID(Integer _highID) {
highID = _highID;
}
@Override
public int compareTo(PairKey other) {
int _lowCompare = lowID.compareTo(other.getLowID());
if (_lowCompare != 0) {
return _lowCompare;
}
int _highCompare = highID.compareTo(other.getHighID());
return _highCompare;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeInt(lowID.intValue());
dataOutput.writeInt(highID.intValue());
}
@Override
public void readFields(DataInput dataInput) throws IOException {
lowID = new Integer(dataInput.readInt());
highID = new Integer(dataInput.readInt());
}
@Override
public String toString() {
return "<" + lowID + ", " + highID + ">";
}
@Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if ( o == null || this.getClass() != o.getClass()) {
return false;
}
PairKey other = (PairKey) o;
//compare fields
if (this.lowID != null ? this.lowID.equals(other.getLowID()) == false : other.getLowID() != null) return false;
if (this.highID != null ? this.highID.equals(other.getHighID()) == false : other.getHighID() != null) return false;
return true;
}
@Override
public int hashCode() {
int _lowHash = this.lowID.hashCode();
int _highHash = this.highID.hashCode();
return 163 * (_lowHash ) + _highHash;
}
}
这是我的映射器代码,我将所有通过阈值的movieID存储在每个用户的集合中,然后发出此集合中所有可能的对:
public static class PairMapper extends Mapper<Text, Text, PairKey, IntWritable> {
private Map<Integer, SortedSet<Integer>> temp = new HashMap<Integer, SortedSet<Integer>>();
private IntWritable one = new IntWritable(1);
private PairKey _key = new PairKey();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
Integer userID = new Integer(key.toString());
String[] vals = value.toString().split("\t");
String _movieID = vals[0];
String _rating = vals[1];
Integer movieID = new Integer(_movieID);
Integer rating = new Integer(_rating);
if (rating > 3) {
SortedSet candidates = temp.get(userID);
if (candidates == null) {
candidates = new TreeSet<Integer>();
}
candidates.add(movieID);
temp.put(userID, candidates);
}
}//map
public void cleanup(Context context) throws IOException, InterruptedException {
for (Map.Entry<Integer, SortedSet<Integer>> e : temp.entrySet()) {
SortedSet<Integer> _set = e.getValue();
Integer [] arr = _set.toArray(new Integer[_set.size()]);
for (int i = 0 ; i < arr.length-1 ; i++) {
for (int j = i+1 ; j < arr.length ; j++) {
_key.setLowID(arr[i]);
_key.setHighID(arr[j]);
context.write(_key, one);
}//for j
}//for i
}
}//cleanup
}//PairMapper
这是我的减速机:
public static class PairReducer extends Reducer<PairKey, Iterable<IntWritable>, Text, IntWritable> {
public void reduce(PairKey key, Iterable<IntWritable> vals, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : vals) {
sum+= val.get();
}//for
IntWritable result = new IntWritable(sum);
context.write(new Text(key.toString()), result);
} //reduce
}
这是我的驱动程序主要方法:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: moviepairs <in> <out>");
System.exit(2);
}
//CONFIGURE THE JOB
Job job = new Job(conf, "movie pairs");
job.setJarByClass(MoviePairs.class);
job.setSortComparatorClass(CompositeKeyComparator.class);
job.setPartitionerClass(NaturalKeyPartitioner.class);
job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
//map-reduce classes
job.setMapperClass(PairMapper.class);
job.setCombinerClass(PairReducer.class);
job.setReducerClass(PairReducer.class);
//key-val classes
job.setMapOutputKeyClass(PairKey.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true)? 0 :1);
}
我希望在我的减速机中得到这个:
pair <1,2>: [1,1,1]
但相反,reducer似乎并不理解对的相等性。改为输出:
pair<1,2>: [1]
pair<1,2>: [1]
pair<1,2>: [1]
不确定我在那里缺少什么。正如你所看到的,我已经尝试了一些东西,比如添加一个自定义排序器(我不相信我需要并使用分组比较器,自定义分区器),但我认为简单地重写hashcode / equals应该考虑到这一点? (不确定)。我在网上找到的所有例子似乎都遵循这一点,它们似乎都有效。
答案 0 :(得分:0)
与那些初学者问题一样,问题完全无关紧要。我搞砸了Reducer界面。
而不是<KEYIN, VALIN...>
我在做<KEYIN, ITERABLE<VALIN>....>