我的工作只有mapper PrepareData,需要将文本数据转换为 SequencialFile ,其中 VLongWritable 为键和 DoubleArrayWritable 为值。
当我使用行运行455000x90(~384 Mb)数据时,例如:
13.124,123.12,12.12,... 1.12
23.12,1.5,12.6,... 6.123
...
在本地模式下,平均需要:
=>平均52-53秒。
但是当我使用这两台机器(Athlon 64 X2 Dual Core 5600 +,3700 +)在真正的集群中运行它时,最好的情况下需要81秒。
使用4个映射器(块大小~96 mb)和2个reducer执行作业。
由 Hadoop 0.21.0 提供支持的群集,配置为jvm重用。
映射:
public class PrepareDataMapper
extends Mapper<LongWritable, Text, VLongWritable, DoubleArrayWritable> {
private int size;
// hint
private DoubleWritable[] doubleArray;
private DoubleArrayWritable mapperOutArray = new DoubleArrayWritable();
private VLongWritable mapOutKey = new VLongWritable();
@Override
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
size = conf.getInt("dataDimSize", 0);
doubleArray = new DoubleWritable[size];
for (int i = 0; i < size; i++) {
doubleArray[i] = new DoubleWritable();
}
}
@Override
public void map(
LongWritable key,
Text row,
Context context) throws IOException, InterruptedException {
String[] fields = row.toString().split(",");
for (int i = 0; i < size; i++) {
doubleArray[i].set(Double.valueOf(fields[i]));
}
mapperOutArray.set(doubleArray);
mapOutKey.set(key.get());
context.write(mapOutKey, mapperOutArray);
}
}
DoubleArrayWritable :
public class DoubleArrayWritable extends ArrayWritable {
public DoubleArrayWritable() {
super(DoubleWritable.class);
}
public DoubleArrayWritable(DoubleWritable[] values) {
super(DoubleWritable.class, values);
}
public void set(DoubleWritable[] values) {
super.set(values);
}
public DoubleWritable get(int idx) {
return (DoubleWritable) get()[idx];
}
public double[] getVector(int from, int to) {
int sz = to - from + 1;
double[] vector = new double[sz];
for (int i = from; i <= to; i++) {
vector[i-from] = get(i).get();
}
return vector;
}
}
答案 0 :(得分:2)
我可以猜测不同的是工作时间。对于本地模式,它是几秒钟,而对于群集,它通常是几十秒。
要验证此假设,您可以放置更多数据并验证群集性能是否比单个节点更好。
其他可能的原因 - 您可能没有足够的映射器来充分利用您的硬件。我建议尝试使用多少个核心的映射器x2。