我一直在运行一些测试,以了解内联函数代码(在代码本身中显式编写函数算法)如何影响性能。我将一个简单的字节数组写入整数代码,然后将其包装在一个函数中,从另一个类中静态调用它,并从类本身静态调用它。代码如下:
public class FunctionCallSpeed {
public static final int numIter = 50000000;
public static void main (String [] args) {
byte [] n = new byte[4];
long start;
System.out.println("Function from Static Class =================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
StaticClass.toInt(n);
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
System.out.println("Function from Class ========================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
toInt(n);
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
int actual = 0;
int len = n.length;
System.out.println("Inline Function ============================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
for (int j = 0; j < len; j++) {
actual += n[len - 1 - j] << 8 * j;
}
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}
public static int toInt(byte [] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
结果如下:
Function from Static Class =================
Elapsed time: 0.096559931s
Function from Class ========================
Elapsed time: 0.015741711s
Inline Function ============================
Elapsed time: 0.837626286s
字节码是否有奇怪的东西?我自己看过字节码,但我不是很熟悉,我无法做出正面或反面。
修改
我添加了assert
语句来读取输出,然后将读取的字节随机化,基准测试现在的行为与我认为的方式相同。感谢Tomasz Nurkiewicz,他向我指出了微基准文章。因此得到的代码是:
public class FunctionCallSpeed {
public static final int numIter = 50000000;
public static void main (String [] args) {
byte [] n;
long start, end;
int checker, calc;
end = 0;
System.out.println("Function from Object =================");
for (int i = 0; i < numIter; i++) {
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
calc = StaticClass.toInt(n);
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
end = 0;
System.out.println("Function from Class ==================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
calc = toInt(n);
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
int len = 4;
end = 0;
System.out.println("Inline Function ======================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
calc = 0;
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
for (int j = 0; j < len; j++) {
calc += n[len - 1 - j] << 8 * j;
}
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}
public static byte [] toByte(int val) {
byte [] n = new byte[4];
for (int i = 0; i < 4; i++) {
n[i] = (byte)((val >> 8 * i) & 0xFF);
}
return n;
}
public static int toInt(byte [] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
结果:
Function from Static Class =================
Elapsed time: 9.276437031s
Function from Class ========================
Elapsed time: 9.225660708s
Inline Function ============================
Elapsed time: 5.9512E-5s
答案 0 :(得分:5)
总是很难保证JIT正在做什么,但如果我不得不猜测,它注意到函数的返回值从未被使用过,并且优化了很多。
如果您实际使用函数的返回值,我敢打赌它会改变速度。
答案 1 :(得分:3)
您有几个问题,但主要问题是您正在测试一个优化代码的一次迭代。这肯定会给你带来喜忧参半的结果。我建议运行测试2秒,忽略前10,000次迭代。
如果不保留循环的结果,则可以在一些随机间隔后丢弃整个循环。
将每个测试分解为单独的方法
public class FunctionCallSpeed {
public static final int numIter = 50000000;
private static int dontOptimiseAway;
public static void main(String[] args) {
byte[] n = new byte[4];
for (int i = 0; i < 10; i++) {
test1(n);
test2(n);
test3(n);
System.out.println();
}
}
private static void test1(byte[] n) {
System.out.print("from Static Class: ");
long start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
dontOptimiseAway = FunctionCallSpeed.toInt(n);
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
private static void test2(byte[] n) {
long start;
System.out.print("from Class: ");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
dontOptimiseAway = toInt(n);
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
private static void test3(byte[] n) {
long start;
int actual = 0;
int len = n.length;
System.out.print("Inlined: ");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
for (int j = 0; j < len; j++) {
actual += n[len - 1 - j] << 8 * j;
}
dontOptimiseAway = actual;
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
打印
from Class: 7ns Inlined: 11ns from Static Class: 9ns
from Class: 6ns Inlined: 8ns from Static Class: 8ns
from Class: 6ns Inlined: 9ns from Static Class: 6ns
这表明当内循环单独优化时,效率稍高。
但是,如果我使用优化的字节转换为int
public static int toInt(byte[] num) {
return num[0] + (num[1] << 8) + (num[2] << 16) + (num[3] << 24);
}
所有测试报告
from Static Class: 0ns from Class: 0ns Inlined: 0ns
from Static Class: 0ns from Class: 0ns Inlined: 0ns
from Static Class: 0ns from Class: 0ns Inlined: 0ns
因为它意识到测试没有做任何有用的事情。 ;)
答案 2 :(得分:3)
我将您的测试用例移植到caliper:
import com.google.caliper.SimpleBenchmark;
public class ToInt extends SimpleBenchmark {
private byte[] n;
private int total;
@Override
protected void setUp() throws Exception {
n = new byte[4];
}
public int timeStaticClass(int reps) {
for (int i = 0; i < reps; i++) {
total += StaticClass.toInt(n);
}
return total;
}
public int timeFromClass(int reps) {
for (int i = 0; i < reps; i++) {
total += toInt(n);
}
return total;
}
public int timeInline(int reps) {
for (int i = 0; i < reps; i++) {
int actual = 0;
int len = n.length;
for (int i1 = 0; i1 < len; i1++) {
actual += n[len - 1 - i1] << 8 * i1;
}
total += actual;
}
return total;
}
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
class StaticClass {
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
确实看起来像内联版本是最慢的,而两个静态版本几乎相同(如预期的那样):
原因很难想象。我可以想到两个因素:
当代码块尽可能小且易于推理时,JVM在执行微优化方面更胜一筹。当函数内联时,整个代码变得更加复杂并且JVM放弃了。使用较小的toInt()
函数,JIT更聪明
缓存局部性 - 不知何故JVM使用两个小块代码(循环和方法)而不是一个更大的代码表现得更好
答案 3 :(得分:0)
您的测试存在缺陷。第二个测试是已经运行的第一个测试的好处。您需要在自己的JVM调用中运行每个测试用例。