Question

我正在尝试将Gamma校正添加到渲染引擎中。我有两个问题：

1）Math.pow真的很慢（相对于每秒数千次调用）。所以我需要创建一个可以访问的预先计算的伽马表，而不是动态计算。（这是额外的信息，而不是实际问题）。

2）目前，我只能通过解包整数像素，通过用相应的伽玛修改值替换RGBA通道来应用伽玛，然后重新打包像素并将其发送回图像缓冲区来实现。性能打击并非可怕... 但是它将固定的60fps固定时间步长降低到大约40fps左右（渲染了几张图像）。

我尝试在本机代码中实现整数解包/打包，只是为了看不到性能提升并导致虚拟机崩溃（可能是内存检查错误，但我现在并不关心它）。

有没有办法在不拆包/打包像素的情况下应用伽玛？如果没有，您建议使用哪种方法来执行此操作？

N.B。不要说使用BufferedImageOp。它很慢，只能在整个图像上运行（我需要像素特定）。

其他信息：

像素包装：

public static int[] unpackInt(int argb, int type) {
    int[] vals = null;
    int p1 = 0;
    int p2 = 1;
    int p3 = 2;
    int p4 = 3;
    switch (type) {
    case TYPE_RGB:
        vals = new int[3];
        vals[p1] = argb >> 16 & 0xFF;
        vals[p2] = argb >> 8 & 0xFF;
        vals[p3] = argb & 0xFF;
        break;
    case TYPE_RGBA:
    case TYPE_ARGB:
        vals = new int[4];
        vals[p4] = argb & 0xFF;
        vals[p3] = argb >> 8 & 0xFF;
        vals[p2] = argb >> 16 & 0xFF;
        vals[p1] = argb >> 24 & 0xFF;
        break;
    default:
        throw (new IllegalArgumentException(
                "type must be a valid field defined by ColorUtils class"));
    }
    return vals;
}

public static int packInt(int... rgbs) {

    if (rgbs.length != 3 && rgbs.length != 4) {
        throw (new IllegalArgumentException(
                "args must be valid RGB, ARGB or RGBA value."));
    }
    int color = rgbs[0];
    for (int i = 1; i < rgbs.length; i++) {
        color = (color << 8) + rgbs[i];
    }
    return color;
}

我先前废弃了代码，但我正在使用此算法进行伽马校正：

protected int correctGamma(int pixel, float gamma) {
    float ginv = 1 / gamma;
    int[] rgbVals = ColorUtils.unpackInt(pixel, ColorUtils.TYPE_ARGB);
    for(int i = 0; i < rgbVals.length; i++) {
        rgbVals[i] = (int) Math.round(255 - Math.pow(rgbVals[i] / 255.0, ginv));
    }
    return ColorUtils.packInt(rgbVals);
}

解决方案

我最终将GargantuChet提出的许多想法结合到一个似乎运作良好的系统中（性能没有下降）。

一个名为GammaTable的类使用伽马值修改器进行实例化（0.0-1.0更暗，＆gt; 1.0更亮）。构造函数调用一个内部方法，为该值构建gamma表。此方法也可用于稍后重置伽玛：

/**
 * Called when a new gamma value is set to rebuild the gamma table.
 */
private synchronized void buildGammaTable() {
    table = new int[TABLE_SIZE];
    float ginv = 1 / gamma;
    double colors = COLORS;
    for(int i=0;i<table.length;i++) {
        table[i] = (int) Math.round(colors * Math.pow(i / colors, ginv)); 
    }
}

要应用伽玛，GammaTable采用整数像素，将其解包，查找修改后的伽玛值，并返回重新包装的整数*

/**
 * Applies the current gamma table to the given integer pixel.
 * @param color the integer pixel to which gamma will be applied
 * @param type a pixel type defined by ColorUtils
 * @param rgbArr optional pre-instantiated array to use when unpacking.  May be null.
 * @return the modified pixel value
 */
public int applyGamma(int color, int type, int[] rgbArr) {
    int[] argb = (rgbArr != null) ? ColorUtils.unpackInt(rgbArr, color):ColorUtils.unpackInt(color, type);
    for(int i = 0; i < argb.length; i++) {
        int col = argb[i];
        argb[i] = table[col];
    }
    int newColor = ColorUtils.packInt(argb);
    return newColor;
}

为屏幕上的每个像素调用applyGamma方法。

*事实证明，解压缩和重新包装像素并没有减慢任何速度。由于某种原因，嵌套调用（即ColorUtils.packInt(ColorUtils.unpackInt))导致该方法花费更长的时间。有趣的是，我还不得不停止使用ColorUtils.unpackInt的预实例化数组，因为它似乎导致了巨大的性能命中。允许解压缩方法在每次调用时创建一个新数组似乎不会影响当前上下文中的性能。

Answer 1

我想知道它是否是导致开销的数学运算。每次调用unpackInt，您都要创建一个新的数组，JVM必须分配并初始化为零。这可能导致很多堆活动真的不需要。

您可能会考虑一种方法，其中unpackInt将目标数组作为参数。作为第一遍，使用示例看起来像

int[] rgbVals = new int[4];

protected int correctGamma(int pixel, float gamma) {
    float ginv = 1 / gamma;
    ColorUtils.unpackInt(pixel, ColorUtils.TYPE_ARGB, rgbVals);
    for(int i = 0; i &lt; rgbVals.length; i++) {
        rgbVals[i] = (int) Math.round(255 - Math.pow(rgbVals[i] / 255.0, ginv));
    }
    return ColorUtils.packInt(rgbVals);
}

这将真正减少对象创建开销，因为您只创建一次新数组，而不是每次调用unpackInt（通过correctGamma）一次。唯一需要注意的是，在重新打包int时，您不能再使用数组长度。通过将类型作为参数传递给它，或者通过在unpackInt的TYPE_RGB情况下将未使用的元素设置为0，可以很容易地解决这个问题：

case TYPE_RGB:
    vals[p1] = 0;
    vals[p2] = argb >> 16 & 0xFF;
    vals[p3] = argb >> 8 & 0xFF;
    vals[p4] = argb & 0xFF;

这也可能是为伽马校正创建一个更专业的类的好机会，它包含了所有这些行为：

class ScreenContent {

    // ...

    GammaCorrector gammaCorrector = new GammaCorrector();

    // ...

    int[][] image;

    void correctGamma() {
        for (int[] row : image) {
            for (int i = 0; i &lt; row.length; i++) {
                row[i] = gammaCorrector.correct(row[i], gamma);
            }
        }
    }
}

class GammaCorrector {
    private int[] unpacked = new int[4];

    public int correct(int pixel, float gamma) {
        float ginv = 1 / gamma;
        ColorUtils.unpackInt(pixel, ColorUtils.TYPE_ARGB, unpacked);
        for(int i = 0; i &lt; rgbVals.length; i++) {
            rgbVals[i] = (int) Math.round(255 - Math.pow(unpacked[i] / 255.0, ginv));
        }
        return ColorUtils.packInt(unpacked);
    }
}

您可以通过创建类似struct的类来保存解压缩的值来消除数组和循环。最里面的for()循环每秒执行数十万次，但每次循环执行时，它只运行几次迭代。一个现代的CPU should handle this case very well，但它仍然值得尝试。

您还可以使用有界线程池并行显示行。每个CPU核心一个线程的边界可能有意义。图形硬件设计侧重于每个像素上的操作通常相似但独立的事实，并且它们具有大规模的并行性以实现良好的性能。

另请考虑using a debug build of the JVM查看生成的说明以获得更好的洞察力。理想情况下，您可以尽可能少地修改代码，只在JVM错过优化机会的地方进行更改。

如果最终转向本机代码，您可能会考虑在适当的时候使用一些SSE指令。我相信有些操作适用于压缩整数，基本上对打包整数中的每个字节应用相同的操作，而不必解包，计算和重新打包。这可以节省大量时间，但可能会改变您计算伽玛的方式。好处是它很快 - 单个SSE寄存器可以让你在一条指令中操作16个字节，这种并行性值得花些时间去利用。

Answer 2

另一种方法是使用OpenGL。（我认为LWJGL会在Java中允许它。）您可以上传包含直线到伽马校正表的一维纹理，然后编写一个将伽玛表应用于像素的glsl着色器。不确定这将如何适应您当前的处理模型，但我用它来实时处理1920x1080高清RGBA帧。

将伽玛校正应用于压缩整数像素

2 个答案: