我应该尝试为我的团队优化这种方法,使用Java编写视频解码器,尽管我没有看到任何好的方法。下面的函数看起来似乎没有任何显着的加速,因为它主要包含简单的加法/减法/等。
void inverseTransform(int macroBlockIndex, int dataBlockIndex) {
int[] workSpace = new int[64];
short[] data = new short[64];
int z1, z2, z3, z4, z5;
int tmp0, tmp1, tmp2, tmp3;
int tmp10, tmp11, tmp12, tmp13;
int pointer = 0;
for (int index = 8; index > 0; index--) {
if (dataBlockBuffer[pointer + 8] == 0 && dataBlockBuffer[pointer + 16] == 0 && dataBlockBuffer[pointer + 24] == 0 && dataBlockBuffer[pointer + 32] == 0 && dataBlockBuffer[pointer + 40] == 0 && dataBlockBuffer[pointer + 48] == 0 && dataBlockBuffer[pointer + 56] == 0) {
int dcValue = dataBlockBuffer[pointer] << PASS1_BITS;
workSpace[pointer + 0] = dcValue;
workSpace[pointer + 8] = dcValue;
workSpace[pointer + 16] = dcValue;
workSpace[pointer + 24] = dcValue;
workSpace[pointer + 32] = dcValue;
workSpace[pointer + 40] = dcValue;
workSpace[pointer + 48] = dcValue;
workSpace[pointer + 56] = dcValue;
pointer++;
continue;
}
z2 = dataBlockBuffer[pointer + 16];
z3 = dataBlockBuffer[pointer + 48];
z1 = (z2 + z3) * FIX_0_541196100;
tmp2 = z1 + z3 * -FIX_1_847759065;
tmp3 = z1 + z2 * FIX_0_765366865;
z2 = dataBlockBuffer[pointer];
z3 = dataBlockBuffer[pointer + 32];
tmp0 = (z2 + z3) << BITS;
tmp1 = (z2 - z3) << BITS;
tmp10 = tmp0 + tmp3;
tmp13 = tmp0 - tmp3;
tmp11 = tmp1 + tmp2;
tmp12 = tmp1 - tmp2;
tmp0 = dataBlockBuffer[pointer + 56];
tmp1 = dataBlockBuffer[pointer + 40];
tmp2 = dataBlockBuffer[pointer + 24];
tmp3 = dataBlockBuffer[pointer + 8];
z1 = tmp0 + tmp3;
z2 = tmp1 + tmp2;
z3 = tmp0 + tmp2;
z4 = tmp1 + tmp3;
z5 = (z3 + z4) * FIX_1_175875602;
tmp0 = tmp0 * FIX_0_298631336;
tmp1 = tmp1 * FIX_2_053119869;
tmp2 = tmp2 * FIX_3_072711026;
tmp3 = tmp3 * FIX_1_501321110;
z1 = z1 * -FIX_0_899976223;
z2 = z2 * -FIX_2_562915447;
z3 = z3 * -FIX_1_961570560;
z4 = z4 * -FIX_0_390180644;
z3 += z5;
z4 += z5;
tmp0 += z1 + z3;
tmp1 += z2 + z4;
tmp2 += z2 + z3;
tmp3 += z1 + z4;
workSpace[pointer + 0] = ((tmp10 + tmp3 + (1 << F1)) >> F2);
workSpace[pointer + 56] = ((tmp10 - tmp3 + (1 << F1)) >> F2);
workSpace[pointer + 8] = ((tmp11 + tmp2 + (1 << F1)) >> F2);
workSpace[pointer + 48] = ((tmp11 - tmp2 + (1 << F1)) >> F2);
workSpace[pointer + 16] = ((tmp12 + tmp1 + (1 << F1)) >> F2);
workSpace[pointer + 40] = ((tmp12 - tmp1 + (1 << F1)) >> F2);
workSpace[pointer + 24] = ((tmp13 + tmp0 + (1 << F1)) >> F2);
workSpace[pointer + 32] = ((tmp13 - tmp0 + (1 << F1)) >> F2);
pointer++;
}
pointer = 0;
for (int index = 0; index < 8; index++) {
z2 = workSpace[pointer + 2];
z3 = workSpace[pointer + 6];
z1 = (z2 + z3) * FIX_0_541196100;
tmp2 = z1 + z3 * -FIX_1_847759065;
tmp3 = z1 + z2 * FIX_0_765366865;
tmp0 = (workSpace[pointer + 0] + workSpace[pointer + 4]) << BITS;
tmp1 = (workSpace[pointer + 0] - workSpace[pointer + 4]) << BITS;
tmp10 = tmp0 + tmp3;
tmp13 = tmp0 - tmp3;
tmp11 = tmp1 + tmp2;
tmp12 = tmp1 - tmp2;
tmp0 = workSpace[pointer + 7];
tmp1 = workSpace[pointer + 5];
tmp2 = workSpace[pointer + 3];
tmp3 = workSpace[pointer + 1];
z1 = tmp0 + tmp3;
z2 = tmp1 + tmp2;
z3 = tmp0 + tmp2;
z4 = tmp1 + tmp3;
z5 = (z3 + z4) * FIX_1_175875602;
tmp0 = tmp0 * FIX_0_298631336;
tmp1 = tmp1 * FIX_2_053119869;
tmp2 = tmp2 * FIX_3_072711026;
tmp3 = tmp3 * FIX_1_501321110;
z1 = z1 * -FIX_0_899976223;
z2 = z2 * -FIX_2_562915447;
z3 = z3 * -FIX_1_961570560;
z4 = z4 * -FIX_0_390180644;
z3 += z5;
z4 += z5;
tmp0 += z1 + z3;
tmp1 += z2 + z4;
tmp2 += z2 + z3;
tmp3 += z1 + z4;
data[pointer + 0] = (short) ((tmp10 + tmp3) >> F3);
data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);
data[pointer + 1] = (short) ((tmp11 + tmp2) >> F3);
data[pointer + 6] = (short) ((tmp11 - tmp2) >> F3);
data[pointer + 2] = (short) ((tmp12 + tmp1) >> F3);
data[pointer + 5] = (short) ((tmp12 - tmp1) >> F3);
data[pointer + 3] = (short) ((tmp13 + tmp0) >> F3);
data[pointer + 4] = (short) ((tmp13 - tmp0) >> F3);
pointer += 8;
}
short[] temp = imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex];
for (int i = 0; i < data.length; i++)
temp[i] = data[i]; //imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex][i] = data[i];
}
如果可以,我应该将基本数学结合起来,或者你建议什么?
答案 0 :(得分:1)
如果没有性能问题的具体证据,我可以看到几项可能的改进(虽然不确定你会从中获得多少改善)。
pointer+7
重复几次。你可以计算一次。System.arraycopy()
复制数组。 答案 1 :(得分:1)
我看不出任何明显的东西。除了亚历克斯所说的,还有两个小建议可能会有所帮助:
1)第一个循环中的长if
语句有许多失败条件。你有没有订购它,所以最有可能失败的是第一个?通过短路评估,您可以越早找到false
评估整个表达式的工作量越少。
2)你在两个for循环之外声明了很多变量,我可以看出为什么你这样做了。如果你在两个循环中移动声明,那么JVM可能更能够优化事物,因此变量被声明为尽可能在本地。
对于这两种情况,你需要做一些计时运行,看看它们是否真的有所不同。您可能还想使用分析器来查看代码花费大部分时间的位置。
我有另外一条评论。在以下行中:
data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);
您正在使用>>
而不是>>>
来对可能的负数进行位移。如果tmp3&gt;你确定这就是你想要做的吗? tmp10?
答案 2 :(得分:1)
与其他海报一样,我认为你可以做的很少就是优化它。
出于解压,我可能尝试在条件之前将所有值(dataBlockBuffer [x])读入局部变量,然后将条件更改为:
if((data0 | data1 | data2 | ...)== 0)...
这样,你可能有更少的分支和条件失败的地方(大概是大部分时间?)数据已经准备就绪了。
但是那说,我认为你不会刮胡子。
要考虑的另一个非常微小的事情是,如果您可以看到一种方法,那么在数组周围包装ByteBuffer可以让您一次读/写几个值。 (至少在多次写'dcValue'时你可以写一个long而不是JIT编译器没有进行这种优化)。
在桌面上,我可能已经考虑过:
无论如何,我会认为最后两种选择是“极端”的选择 - 有可能投入大量时间以获取微不足道的收益。