我一直在使用java程序(由其他人开发)进行文本到语音合成。合成是通过“di-phones”的连接完成的。在oroginal版本中,没有信号处理。刚刚收集了双音素并将它们连接在一起以产生输出。为了改善输出,我尝试执行级联语音信号的“相位匹配”。我所做的修改总结如下:
以下是我在现有程序中更改的代码部分:
音频输入
每个双音素都会调用此段。
原始版本
audioInputStream = AudioSystem.getAudioInputStream(sound);
while ((cnt = audioInputStream.read(byteBuffer, 0, byteBuffer.length)) != -1) {
if (cnt > 0) {
byteArrayPlayStream.write(byteBuffer, 0, cnt);
}
}
我的版本
// public varialbe declarations
byte byteSoundFile[]; // byteSoundFile will contain a whole word or the diphones of a whole word
short shortSoundFile[] = new short[5000000]; // sound contents are taken in a short[] array for signal processing
short shortBuffer[];
int pos = 0;
int previousPM = 0;
boolean isWord = false;
public static HashMap<String, Integer> peakMap1 = new HashMap<String, Integer>();
public static HashMap<String, Integer> peakMap2 = new HashMap<String, Integer>();
// code for receiving and processing audio data
if(pos == 0) {
// a new word is going to be processed.
// so reset the shortSoundFile array
Arrays.fill(shortSoundFile, (short)0);
}
audioInputStream = AudioSystem.getAudioInputStream(sound);
while ((cnt = audioInputStream.read(byteBuffer, 0, byteBuffer.length)) != -1) {
if (cnt > 0) {
byteArrayPlayStream.write(byteBuffer, 0, cnt);
}
}
byteSoundFile = byteArrayPlayStream.toByteArray();
int nSamples = byteSoundFile.length;
byteArrayPlayStream.reset();
if(nSamples > 80000) { // it is a word
pos = nSamples;
isWord = true;
}
else { // it is a diphone
// audio data is converted from byte to short, so nSamples is halved
nSamples /= 2;
// transfer byteSoundFile contents to shortBuffer using byte-to-short conversion
shortBuffer = new short[nSamples];
for(int i=0; i<nSamples; i++) {
shortBuffer[i] = (short)((short)(byteSoundFile[i<<1]) << 8 | (short)byteSoundFile[(i<<1)+1]);
}
/************************************/
/**** phase-matching starts here ****/
/************************************/
int pm1 = 0;
int pm2 = 0;
String soundStr = sound.toString();
if(soundStr.contains("\\") && soundStr.contains(".")) {
soundStr = soundStr.substring(soundStr.indexOf("\\")+1, soundStr.indexOf("."));
}
if(peakMap1.containsKey(soundStr)) {
// perform overlap and add
System.out.println("we are here");
pm1 = peakMap1.get(soundStr);
pm2 = peakMap2.get(soundStr);
/*
Idea:
If pm1 is located after more than one third of the samples,
then threre will be too much overlapping.
If pm2 is located before the two third of the samples,
then where will also be extra overlapping for the next diphone.
In both of the cases, we will not perform the peak-matching operation.
*/
int idx1 = (previousPM == 0) ? pos : previousPM - pm1;
if((idx1 < 0) || (pm1 > (nSamples/3))) {
idx1 = pos;
}
int idx2 = idx1 + nSamples - 1;
for(int i=idx1, j=0; i<=idx2; i++, j++) {
if(i < pos) {
shortSoundFile[i] = (short) ((shortSoundFile[i] >> 1) + (shortBuffer[j] >> 1));
}
else {
shortSoundFile[i] = shortBuffer[j];
}
}
previousPM = (pm2 < (nSamples/3)*2) ? 0 : idx1 + pm2;
pos = idx2 + 1;
}
else {
// no peak found. simply concatenate the audio data
for(int i=0; i<nSamples; i++) {
shortSoundFile[pos++] = shortBuffer[i];
}
previousPM = 0;
}
音频输出
收集一个单词的所有双音素后,调用该段播放音频输出
原始版
byte audioData[] = byteArrayPlayStream.toByteArray();
... code for writing audioData to output steam
我的版本
byte audioData[];
if(isWord) {
audioData = Arrays.copyOf(byteSoundFile, pos);
isWord = false;
}
else {
audioData = new byte[pos*2];
for(int i=0; i<pos; i++) {
audioData[(i<<1)] = (byte) (shortSoundFile[i] >>> 8);
audioData[(i<<1)+1] = (byte) (shortSoundFile[i]);
}
}
pos = 0;
... code for writing audioData to output steam
但是在修改完成后,输出变得更糟。输出中有很多噪音。
以下是经过修改的示例音频:modified output
以下是原始版本的示例音频:original output
现在我很感激,如果有人能指出产生噪音的原因以及如何将其删除。我在代码中做错了吗?我已经在 Mablab 中测试了我的算法,它运行良好。
答案 0 :(得分:0)
问题已暂时解决。事实证明,byte
数组和short
数组之间的转换不是必需的。可以在byte
阵列上直接执行所需的信号处理操作
我想保持这个问题,以防有人发现给定代码中的错误。