Java - 优化功能,主要包含简单的数学

时间:2011-07-11 16:52:18

标签: java android optimization mathematical-optimization

我应该尝试为我的团队优化这种方法,使用Java编写视频解码器,尽管我没有看到任何好的方法。下面的函数看起来似乎没有任何显着的加速,因为它主要包含简单的加法/减法/等。

void inverseTransform(int macroBlockIndex, int dataBlockIndex) {
    int[] workSpace = new int[64];
    short[] data = new short[64];

    int z1, z2, z3, z4, z5;
    int tmp0, tmp1, tmp2, tmp3;
    int tmp10, tmp11, tmp12, tmp13;

    int pointer = 0;

    for (int index = 8; index > 0; index--) {
        if (dataBlockBuffer[pointer + 8] == 0 && dataBlockBuffer[pointer + 16] == 0 && dataBlockBuffer[pointer + 24] == 0 && dataBlockBuffer[pointer + 32] == 0 && dataBlockBuffer[pointer + 40] == 0 && dataBlockBuffer[pointer + 48] == 0 && dataBlockBuffer[pointer + 56] == 0) {
            int dcValue = dataBlockBuffer[pointer] << PASS1_BITS;

            workSpace[pointer + 0] = dcValue;
            workSpace[pointer + 8] = dcValue;
            workSpace[pointer + 16] = dcValue;
            workSpace[pointer + 24] = dcValue;
            workSpace[pointer + 32] = dcValue;
            workSpace[pointer + 40] = dcValue;
            workSpace[pointer + 48] = dcValue;
            workSpace[pointer + 56] = dcValue;

            pointer++;
            continue;
        }

        z2 = dataBlockBuffer[pointer + 16];
        z3 = dataBlockBuffer[pointer + 48];

        z1 = (z2 + z3) * FIX_0_541196100;
        tmp2 = z1 + z3 * -FIX_1_847759065;
        tmp3 = z1 + z2 * FIX_0_765366865;

        z2 = dataBlockBuffer[pointer];
        z3 = dataBlockBuffer[pointer + 32];

        tmp0 = (z2 + z3) << BITS;
        tmp1 = (z2 - z3) << BITS;

        tmp10 = tmp0 + tmp3;
        tmp13 = tmp0 - tmp3;
        tmp11 = tmp1 + tmp2;
        tmp12 = tmp1 - tmp2;

        tmp0 = dataBlockBuffer[pointer + 56];
        tmp1 = dataBlockBuffer[pointer + 40];
        tmp2 = dataBlockBuffer[pointer + 24];
        tmp3 = dataBlockBuffer[pointer + 8];

        z1 = tmp0 + tmp3;
        z2 = tmp1 + tmp2;
        z3 = tmp0 + tmp2;
        z4 = tmp1 + tmp3;
        z5 = (z3 + z4) * FIX_1_175875602;

        tmp0 = tmp0 * FIX_0_298631336;
        tmp1 = tmp1 * FIX_2_053119869;
        tmp2 = tmp2 * FIX_3_072711026;
        tmp3 = tmp3 * FIX_1_501321110;
        z1 = z1 * -FIX_0_899976223;
        z2 = z2 * -FIX_2_562915447;
        z3 = z3 * -FIX_1_961570560;
        z4 = z4 * -FIX_0_390180644;

        z3 += z5;
        z4 += z5;

        tmp0 += z1 + z3;
        tmp1 += z2 + z4;
        tmp2 += z2 + z3;
        tmp3 += z1 + z4;

        workSpace[pointer + 0] = ((tmp10 + tmp3 + (1 << F1)) >> F2);
        workSpace[pointer + 56] = ((tmp10 - tmp3 + (1 << F1)) >> F2);
        workSpace[pointer + 8] = ((tmp11 + tmp2 + (1 << F1)) >> F2);
        workSpace[pointer + 48] = ((tmp11 - tmp2 + (1 << F1)) >> F2);
        workSpace[pointer + 16] = ((tmp12 + tmp1 + (1 << F1)) >> F2);
        workSpace[pointer + 40] = ((tmp12 - tmp1 + (1 << F1)) >> F2);
        workSpace[pointer + 24] = ((tmp13 + tmp0 + (1 << F1)) >> F2);
        workSpace[pointer + 32] = ((tmp13 - tmp0 + (1 << F1)) >> F2);

        pointer++;
    }

    pointer = 0;

    for (int index = 0; index < 8; index++) {
        z2 = workSpace[pointer + 2];
        z3 = workSpace[pointer + 6];

        z1 = (z2 + z3) * FIX_0_541196100;
        tmp2 = z1 + z3 * -FIX_1_847759065;
        tmp3 = z1 + z2 * FIX_0_765366865;

        tmp0 = (workSpace[pointer + 0] + workSpace[pointer + 4]) << BITS;
        tmp1 = (workSpace[pointer + 0] - workSpace[pointer + 4]) << BITS;

        tmp10 = tmp0 + tmp3;
        tmp13 = tmp0 - tmp3;
        tmp11 = tmp1 + tmp2;
        tmp12 = tmp1 - tmp2;

        tmp0 = workSpace[pointer + 7];
        tmp1 = workSpace[pointer + 5];
        tmp2 = workSpace[pointer + 3];
        tmp3 = workSpace[pointer + 1];

        z1 = tmp0 + tmp3;
        z2 = tmp1 + tmp2;
        z3 = tmp0 + tmp2;
        z4 = tmp1 + tmp3;

        z5 = (z3 + z4) * FIX_1_175875602;

        tmp0 = tmp0 * FIX_0_298631336;
        tmp1 = tmp1 * FIX_2_053119869;
        tmp2 = tmp2 * FIX_3_072711026;
        tmp3 = tmp3 * FIX_1_501321110;
        z1 = z1 * -FIX_0_899976223;
        z2 = z2 * -FIX_2_562915447;
        z3 = z3 * -FIX_1_961570560;
        z4 = z4 * -FIX_0_390180644;

        z3 += z5;
        z4 += z5;

        tmp0 += z1 + z3;
        tmp1 += z2 + z4;
        tmp2 += z2 + z3;
        tmp3 += z1 + z4;

        data[pointer + 0] = (short) ((tmp10 + tmp3) >> F3);
        data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);
        data[pointer + 1] = (short) ((tmp11 + tmp2) >> F3);
        data[pointer + 6] = (short) ((tmp11 - tmp2) >> F3);
        data[pointer + 2] = (short) ((tmp12 + tmp1) >> F3);
        data[pointer + 5] = (short) ((tmp12 - tmp1) >> F3);
        data[pointer + 3] = (short) ((tmp13 + tmp0) >> F3);
        data[pointer + 4] = (short) ((tmp13 - tmp0) >> F3);

        pointer += 8;
    }
    short[] temp = imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex];
    for (int i = 0; i < data.length; i++)
        temp[i] = data[i]; //imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex][i] = data[i];
}

如果可以,我应该将基本数学结合起来,或者你建议什么?

3 个答案:

答案 0 :(得分:1)

如果没有性能问题的具体证据,我可以看到几项可能的改进(虽然不确定你会从中获得多少改善)。

  1. 用变量替换重复计算。例如,pointer+7重复几次。你可以计算一次。
  2. 使用System.arraycopy()复制数组。

答案 1 :(得分:1)

我看不出任何明显的东西。除了亚历克斯所说的,还有两个小建议可能会有所帮助:

1)第一个循环中的长if语句有许多失败条件。你有没有订购它,所以最有可能失败的是第一个?通过短路评估,您可以越早找到false评估整个表达式的工作量越少。

2)你在两个for循环之外声明了很多变量,我可以看出为什么你这样做了。如果你在两个循环中移动声明,那么JVM可能更能够优化事物,因此变量被声明为尽可能在本地。

对于这两种情况,你需要做一些计时运行,看看它们是否真的有所不同。您可能还想使用分析器来查看代码花费大部分时间的位置。

我有另外一条评论。在以下行中:

data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);

您正在使用>>而不是>>>来对可能的负数进行位移。如果tmp3&gt;你确定这就是你想要做的吗? tmp10?

答案 2 :(得分:1)

与其他海报一样,我认为你可以做的很少就是优化它。

出于解压,我可能尝试在条件之前将所有值(dataBlockBuffer [x])读入局部变量,然后将条件更改为:

  

if((data0 | data1 | data2 | ...)== 0)...

这样,你可能有更少的分支和条件失败的地方(大概是大部分时间?)数据已经准备就绪了。

但是那说,我认为你不会刮胡子。

要考虑的另一个非常微小的事情是,如果您可以看到一种方法,那么在数组周围包装ByteBuffer可以让您一次读/写几个值。 (至少在多次写'dcValue'时你可以写一个long而不是JIT编译器没有进行这种优化)。

在桌面上,我可能已经考虑过:

  • 多线程一次处理多个块(这些天大多数人至少有双核心,但猜测可能不在Android上);
  • 查看JIT编译的输出并确保JIT编译器没有做任何过分愚蠢的事情,然后相应地重新编写代码(您可以使用调试JDK执行此操作,但不知道是否有等效的for Android);
  • 是否可能进行自我激活,然后使用GPU /任何明智的编译指示,您的C编译器可能需要编译为特定的SSE /等效指令(我不是这方面的专家,也不知道它有多可能在你的情况)。

无论如何,我会认为最后两种选择是“极端”的选择 - 有可能投入大量时间以获取微不足道的收益。