令我惊讶的是,当"优化"时,我得到更长的时间(10毫秒)通过在数组中预生成结果与原始8毫秒相比进行乘法运算。这只是一个Java怪癖还是PC架构的一般?我有一个带有Java 7,Windows 8 64位的Core i5 760。
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
long sum=0;
int[] sqr = new int[1000];
for(int a=1;a<1000;a++) {sqr[a]=a*a;}
for(int b=1;b<1000;b++)
// for(int a=1;a<1000;a++) {sum+=a*a+b*b;}
for(int a=1;a<1000;a++) {sum+=sqr[a]+sqr[b];}
System.out.println(System.currentTimeMillis()-start+"ms");
System.out.println(sum);
}
}
答案 0 :(得分:12)
Konrad Rudolph commented on the issues与基准测试。所以我忽略了基准并专注于这个问题:
乘法比数组访问快吗?
是的,很有可能。它曾经是20或30年前的另一种方式。
粗略地说,你可以在3个周期内进行整数乘法(悲观,如果你没有得到向量指令),如果你直接得到它,你的内存访问需要4个周期L1缓存,但从那里直接下坡。供参考,参见
Intel 64 and IA-32 Architectures Optimization Reference Manual
Herb Sutter关于这个主题的演讲:Machine Architecture: Things Your Programming Language Never Told You
Java特有的一个问题是pointed out by Ingo在下面的评论中:您还可以在Java中检查边界,这使得已经较慢的数组访问速度更慢......
答案 1 :(得分:2)
更合理的基准是:
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
abstract int run(int iterations) throws Throwable;
private BigDecimal time() {
try {
int nextI = 1;
int i;
long duration;
do {
i = nextI;
long start = System.nanoTime();
run(i);
duration = System.nanoTime() - start;
nextI = (i << 1) | 1;
} while (duration < 1000000000 && nextI > 0);
return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
@Override
public String toString() {
return name + "\t" + time() + " ns";
}
private static void shuffle(int[] a) {
Random chaos = new Random();
for (int i = a.length; i > 0; i--) {
int r = chaos.nextInt(i);
int t = a[r];
a[r] = a[i - 1];
a[i - 1] = t;
}
}
public static void main(String[] args) throws Exception {
final int[] table = new int[1000];
final int[] permutation = new int[1000];
for (int i = 0; i < table.length; i++) {
table[i] = i * i;
permutation[i] = i;
}
shuffle(permutation);
Benchmark[] marks = {
new Benchmark("sequential multiply") {
@Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += i * i;
}
}
return sum;
}
},
new Benchmark("sequential lookup") {
@Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += table[i];
}
}
return sum;
}
},
new Benchmark("random order multiply") {
@Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += permutation[i] * permutation[i];
}
}
return sum;
}
},
new Benchmark("random order lookup") {
@Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += table[permutation[i]];
}
}
return sum;
}
}
};
for (Benchmark mark : marks) {
System.out.println(mark);
}
}
}
打印在我的intel core duo上(是的,它已经老了):
sequential multiply 2218.666 ns
sequential lookup 1081.220 ns
random order multiply 2416.923 ns
random order lookup 2351.293 ns
因此,如果我按顺序访问查找数组(最小化缓存未命中数),并允许热点JVM优化对数组访问的边界检查,则对1000个元素的数组进行略微改进。如果我们对数组进行随机访问,那么这种优势就会消失。此外,如果表更大,查找速度会变慢。例如,对于10000个元素,我得到:
sequential multiply 23192.236 ns
sequential lookup 12701.695 ns
random order multiply 24459.697 ns
random order lookup 31595.523 ns
因此,除非访问模式(几乎)顺序且查找数组较小,否则数组查找并不比乘法快。
在任何情况下,我的测量表明乘法(和加法)仅需4个处理器周期(2GHz CPU上每循环迭代2.3 ns)。你不可能比这更快。此外,除非你每秒进行5亿次乘法,否则乘法不是你的瓶颈,优化代码的其他部分将更有成效。