Question

我一直在使用java程序（由其他人开发）进行文本到语音合成。合成是通过“di-phones”的连接完成的。在oroginal版本中，没有信号处理。刚刚收集了双音素并将它们连接在一起以产生输出。为了改善输出，我尝试执行级联语音信号的“相位匹配”。我所做的修改总结如下：

音频数据从AudioInputStream收集到字节数组中。由于音频数据是16位，我将字节数组转换为短数据阵列。
“信号处理”在短阵列上完成。
要输出音频数据，短数组将再次转换为字节数组。

以下是我在现有程序中更改的代码部分：

音频输入
每个双音素都会调用此段。

原始版本

audioInputStream = AudioSystem.getAudioInputStream(sound);
while ((cnt = audioInputStream.read(byteBuffer, 0, byteBuffer.length)) != -1) {
    if (cnt > 0) {
        byteArrayPlayStream.write(byteBuffer, 0, cnt);
    }
}

我的版本

// public varialbe declarations 
byte    byteSoundFile[];                             // byteSoundFile will contain a whole word or the diphones of a whole word
short   shortSoundFile[]    = new short[5000000];    // sound contents are taken in a short[] array for signal processing
short   shortBuffer[];
int     pos                 = 0;
int     previousPM          = 0;
boolean isWord              = false;
public static HashMap<String, Integer> peakMap1 = new HashMap<String, Integer>(); 
public static HashMap<String, Integer> peakMap2 = new HashMap<String, Integer>();

// code for receiving and processing audio data
if(pos == 0) {
    // a new word is going to be processed.
    // so reset the shortSoundFile array
    Arrays.fill(shortSoundFile, (short)0);
}

audioInputStream = AudioSystem.getAudioInputStream(sound);
while ((cnt = audioInputStream.read(byteBuffer, 0, byteBuffer.length)) != -1) {
    if (cnt > 0) {
        byteArrayPlayStream.write(byteBuffer, 0, cnt);
    }
}

byteSoundFile = byteArrayPlayStream.toByteArray();
int nSamples = byteSoundFile.length;
byteArrayPlayStream.reset();

if(nSamples > 80000) {   // it is a word
    pos     = nSamples;
    isWord  = true;
}
else {              // it is a diphone
    // audio data is converted from byte to short, so nSamples is halved
    nSamples /= 2;

    // transfer byteSoundFile contents to shortBuffer using byte-to-short conversion
    shortBuffer = new short[nSamples];
    for(int i=0; i<nSamples; i++) {
        shortBuffer[i] = (short)((short)(byteSoundFile[i<<1]) << 8 | (short)byteSoundFile[(i<<1)+1]);
    }

    /************************************/
    /**** phase-matching starts here ****/
    /************************************/
    int pm1 = 0;
    int pm2 = 0;
    String soundStr = sound.toString();
    if(soundStr.contains("\\") && soundStr.contains(".")) {
        soundStr = soundStr.substring(soundStr.indexOf("\\")+1, soundStr.indexOf("."));
    }                    
    if(peakMap1.containsKey(soundStr)) {
        // perform overlap and add
        System.out.println("we are here");
        pm1 = peakMap1.get(soundStr);
        pm2 = peakMap2.get(soundStr);

        /*
        Idea:
        If pm1 is located after more than one third of the samples,
        then threre will be too much overlapping.
        If pm2 is located before the two third of the samples, 
        then where will also be extra overlapping for the next diphone.
        In both of the cases, we will not perform the peak-matching operation.
        */
        int idx1 = (previousPM == 0) ? pos : previousPM - pm1;
        if((idx1 < 0) || (pm1 > (nSamples/3))) {
            idx1 = pos;
        }
        int idx2 = idx1 + nSamples - 1;
        for(int i=idx1, j=0; i<=idx2; i++, j++) {
            if(i < pos) {
                shortSoundFile[i] = (short) ((shortSoundFile[i] >> 1) + (shortBuffer[j] >> 1));
            }
            else {
                shortSoundFile[i] = shortBuffer[j];
            }
        }
        previousPM = (pm2 < (nSamples/3)*2) ? 0 : idx1 + pm2;
        pos = idx2 + 1;
    }
    else {
        // no peak found. simply concatenate the audio data
        for(int i=0; i<nSamples; i++) {
            shortSoundFile[pos++] = shortBuffer[i];
    }
    previousPM = 0;
}

音频输出
收集一个单词的所有双音素后，调用该段播放音频输出 原始版

byte audioData[] = byteArrayPlayStream.toByteArray();
... code for writing audioData to output steam

我的版本

byte audioData[];
if(isWord) {
    audioData = Arrays.copyOf(byteSoundFile, pos);
    isWord = false;
}
else {
    audioData = new byte[pos*2];
    for(int i=0; i<pos; i++) {
        audioData[(i<<1)]   = (byte) (shortSoundFile[i] >>> 8);
        audioData[(i<<1)+1] = (byte) (shortSoundFile[i]);
    }
}
pos = 0;
... code for writing audioData to output steam

但是在修改完成后，输出变得更糟。输出中有很多噪音。

以下是经过修改的示例音频：modified output

以下是原始版本的示例音频：original output

现在我很感激，如果有人能指出产生噪音的原因以及如何将其删除。我在代码中做错了吗？我已经在 Mablab 中测试了我的算法，它运行良好。

Answer 1

问题已暂时解决。事实证明，byte数组和short数组之间的转换不是必需的。可以在byte阵列上直接执行所需的信号处理操作我想保持这个问题，以防有人发现给定代码中的错误。

Java中如何减少信号重叠中的噪声？

1 个答案: