Question

我的工作只有mapper PrepareData，需要将文本数据转换为 SequencialFile ，其中 VLongWritable 为键和 DoubleArrayWritable 为值。

当我使用行运行455000x90（~384 Mb）数据时，例如：

13.124,123.12,12.12，... 1.12

23.12,1.5,12.6，... 6.123

...

在本地模式下，平均需要：

在Athlon 64 X2 Dual Core 5600 +，2.79Гг;
在Athlon 64处理器3700 +，1Ггц;

=＆GT;平均52-53秒。

但是当我使用这两台机器（Athlon 64 X2 Dual Core 5600 +，3700 +）在真正的集群中运行它时，最好的情况下需要81秒。

使用4个映射器（块大小~96 mb）和2个reducer执行作业。

由 Hadoop 0.21.0 提供支持的群集，配置为jvm重用。

映射：

public class PrepareDataMapper
       extends Mapper<LongWritable, Text, VLongWritable, DoubleArrayWritable> {

private int size;

// hint
private DoubleWritable[] doubleArray;
private DoubleArrayWritable mapperOutArray = new DoubleArrayWritable();
private VLongWritable mapOutKey = new VLongWritable();

@Override
protected void setup(Context context) throws IOException {
    Configuration conf = context.getConfiguration();
    size = conf.getInt("dataDimSize", 0);
    doubleArray = new DoubleWritable[size];
    for (int i = 0; i < size; i++) {
        doubleArray[i] = new DoubleWritable();
    }
}

@Override
public void map(
        LongWritable key,
        Text row,
        Context context) throws IOException, InterruptedException {
    String[] fields = row.toString().split(",");
    for (int i = 0; i < size; i++) {
        doubleArray[i].set(Double.valueOf(fields[i]));
    }
    mapperOutArray.set(doubleArray);
    mapOutKey.set(key.get());
    context.write(mapOutKey, mapperOutArray);
}   
}

DoubleArrayWritable ：

public class DoubleArrayWritable extends ArrayWritable {

public DoubleArrayWritable() {
    super(DoubleWritable.class);
}

public DoubleArrayWritable(DoubleWritable[] values) {
    super(DoubleWritable.class, values);
}

public void set(DoubleWritable[] values) {
    super.set(values);
}

public DoubleWritable get(int idx) {
    return (DoubleWritable) get()[idx];
}

public double[] getVector(int from, int to) {
    int sz = to - from + 1;
    double[] vector = new double[sz];
    for (int i = from; i <= to; i++) {
        vector[i-from] = get(i).get();
    }
    return vector;
} 
}

Answer 1

我可以猜测不同的是工作时间。对于本地模式，它是几秒钟，而对于群集，它通常是几十秒。
要验证此假设，您可以放置更多数据并验证群集性能是否比单个节点更好。
其他可能的原因 - 您可能没有足够的映射器来充分利用您的硬件。我建议尝试使用多少个核心的映射器x2。

为什么只有映射器的工作在真正的集群中才这么慢？

1 个答案: