Question

我正在研究一个hadoop项目，经过多次访问各种博客和阅读文档，我意识到我需要使用hadoop框架提供的二次排序功能。

我的输入格式为：

DESC(String) Price(Integer) and some other Text

我希望reducer中的值是Price的降序。同时在比较DESC时我有一个方法，它取两个字符串和一个百分比，如果两个字符串之间的相似性等于或大于百分比，那么我应该认为它们是相等的。

问题出在Reduce Job完成之后我可以看到一些DESC与其他字符串类似，但它们属于不同的组。

这是我的Composite key

的compareTo方法

public int compareTo(VendorKey o) {
    int result =-
    result = compare(token, o.token, ":") >= percentage ? 0:1;
    if (result == 0) {
        return pid> o.pid  ?-1: pid < o.pid ?1:0;
    }
    return result;
}

并比较分组比较器的方法

public int compare(WritableComparable a, WritableComparable b) {
    VendorKey one = (VendorKey) a;
    VendorKey two = (VendorKey) b;
    int result = ClusterUtil.compare(one.getToken(), two.getToken(), ":") >= one.getPercentage() ? 0 : 1;
    // if (result != 0)
    // return two.getToken().compareTo(one.getToken());
    return result;
}

Answer 1

您的void zeroOneSort<T>(int *keys, T *values, int len) { //count zeros int numZeros = 0; for(int i=0; i<len; ++i) { if (!keys[i]) ++numZeros; } //fill in positions { int zeroPos=0, onePos = numZeros; for(int i=0; i<len; ++i) { if (!keys[i]) keys[i] = zeroPos++; else keys[i] = onePos++; } } //swap into place for(int i=0; i<len; ++i) { int target; while ((target=keys[i])!=i) { std::swap(keys[i],keys[target]); std::swap(values[i],values[target]); } } //fix up keys for(int i=0; i<numZeros; ++i) { keys[i]=0; } for(int i=numZeros; i<len; ++i) { keys[i]=1; } }方法似乎违反了要求compareTo等于sgn(x.compareTo(y))的公共contract。

Answer 2

在customWritable之后，为一个基本分区程序提供一个复合键和NullWritable值。例如：

.aboutlinks {
  color: #171717;
  text-decoration: none;
  position: relative;
  display: block;
  margin-bottom: 18px;
}

.aboutlinks:after {
  content: '';
  position: absolute;
  bottom: 0;
  left: 0;
  width: 0%;
  border-bottom: 2px solid #171717;
  transition: 0.7s;
}

.aboutlinks:hover:after { 
  width: 100%; 
}

在此之后指定Key sort comparator并使用2个compositeKeyWritable变量进行分组。

Answer 3

随机播放过程中有3个程序：分区，排序和分组。我猜你有多个减速器，你的相似结果由不同的减速器处理，因为它们位于不同的分区中。

您可以将reducer的数量设置为1，或者设置一个自定义分区程序，为您的工作扩展 org.apache.hadoop.mapreduce.Partitioner 。

Hadoop中的二级排序

3 个答案: