Question

我开发了一个应用程序，用于使用另一个API和IBM Watson Speech to Text服务API在c ++中进行流语音识别。

在这两个程序中，我正在使用包含该音频的相同文件

星期天，一阵强烈的雷暴席卷科罗拉多，几场龙卷风降落

此文件的大小为 641,680字节，我一次向语音服务器发送 100,000字节（最大）块到文本服务器。

现在，使用其他API，我可以将所有内容视为一个整体。使用IBM Watson API，我做不到。这是我所做的：

连接到IBM Watson Web服务器（语音到文本API）
发送起始帧{"action":"start","content-type":"audio/mulaw;rate=8000"}
发送二进制100,000字节
发送停止帧{"action":"stop"}
...重复二进制并停止直到最后一个字节。

IBM Watson Speech API只能单独识别这些块
例如

几次龙卷风降落
     一阵雷声
     席卷了科罗拉多
     星期天

这似乎是单个块的输出，并且是块分割之间的单词（例如，在此，“雷暴”部分出现在块的末尾，部分出现在下一个块的开始））因此被错误地识别或删除。

我在做什么错了？

编辑（我正在将c ++与带有用于websocket界面的boost库一起使用）

//Do the websocket handshake 
void IbmWebsocketSession::on_ssl_handshake(beast::error_code ec) {

    auto mToken = mSttServiceObject->GetToken(); // Get the authentication token

    //Complete the websocket handshake and call back the "send_start" function
    mWebSocket.async_handshake_ex(mHost, mUrlEndpoint, [mToken](request_type& reqHead) {reqHead.insert(http::field::authorization,mToken);},
            bind(&IbmWebsocketSession::send_start, shared_from_this(), placeholders::_1));
}

//Sent the start frame
void IbmWebsocketSession::send_start(beast::error_code ec) {

    //Send the START_FRAME and call back the "read_resp" function to receive the "state: listening" message
    mWebSocket.async_write(net::buffer(START_FRAME),
            bind(&IbmWebsocketSession::read_resp, shared_from_this(), placeholders::_1, placeholders::_2));
}

//Sent the binary data
void IbmWebsocketSession::send_binary(beast::error_code ec) {

    streamsize bytes_read = mFilestream.rdbuf()->sgetn(&chunk[0], chunk.size()); //gets the binary data chunks from a file (which is being written at run time

    // Send binary data
    if (bytes_read > mcMinsize) {  //Minimum size defined by IBM  is 100 bytes.
                                   // If chunk size is greater than 100 bytes, then send the data and then callback "send_stop" function
        mWebSocket.binary(true);

        /**********************************************************************
         *  Wait a second before writing the next chunk.
         **********************************************************************/
        this_thread::sleep_for(chrono::seconds(1));

        mWebSocket.async_write(net::buffer(&chunk[0], bytes_read),
                bind(&IbmWebsocketSession::send_stop, shared_from_this(), placeholders::_1));
    } else {                     //If chunk size is less than 100 bytes, then DO NOT send the data only call "send_stop" function
        shared_from_this()->send_stop(ec);
    }

}

void IbmWebsocketSession::send_stop(beast::error_code ec) {

    mWebSocket.binary(false);
    /*****************************************************************
     * Send the Stop message
     *****************************************************************/
    mWebSocket.async_write(net::buffer(mTextStop),
            bind(&IbmWebsocketSession::read_resp, shared_from_this(), placeholders::_1, placeholders::_2));
}

void IbmWebsocketSession::read_resp(beast::error_code ec, size_t bytes_transferred) {
    boost::ignore_unused(bytes_transferred);
        if(mWebSocket.is_open())
        {
            // Read the websocket response and call back the "display_buffer" function
            mWebSocket.async_read(mBuffer, bind(&IbmWebsocketSession::display_buffer, shared_from_this(),placeholders::_1));
        }
        else
            cerr << "Error: " << e->what() << endl;

}

void IbmWebsocketSession::display_buffer(beast::error_code ec) {

    /*****************************************************************
     * Get the buffer into stringstream
     *****************************************************************/
    msWebsocketResponse << beast::buffers(mBuffer.data());

    mResponseTranscriptIBM = ParseTranscript(); //Parse the response transcript

    mBuffer.consume(mBuffer.size()); //Clear the websocket buffer

    if ("Listening" == mResponseTranscriptIBM && true != mSttServiceObject->IsGstFileWriteDone()) { // IsGstFileWriteDone -> checks if the user has stopped speaking
        shared_from_this()->send_binary(ec);
    } else {
        shared_from_this()->close_websocket(ec, 0);
    }
}

Answer 1

IBM Watson Speech to Text has several APIs发送音频和接收转录的文本。根据您的描述，您似乎使用了WebSocket Interface。

对于WebSocket接口，you would open the connection (start), then send individual chunks of data, and - once everything has been transmitted - stop the recognition request。

您尚未共享代码，但似乎您正在启动和停止对每个块的请求。仅在最后一个块之后停止。

我建议您看一下包含不同语言示例的API文档。 The Node.js sample shows how to register for events。 GitHub上也有类似WebSocket API with Python的示例。这是another one that shows the chunking。

Answer 2

@data_henrik是正确的，流程是错误的，它应该是：... START FRAME >>二进制数据>>二进制数据>>二进制数据>> ... >> STOP FRAME

您仅在没有更多要发送的音频块时才发送{"action":"stop"}消息

IBM Watson STT：如何将Websocket接口与多个块一起使用？

2 个答案: