Question

我想用 nodejs 和谷歌语音到文本 api 制作一个实时转录应用程序。

我正在使用 RecordRTC 和 socket.io 将音频块发送到后端服务器。目前我正在录制 1 s 长的块并且转录工作但它不将其视为流，它在处理每个块后发送响应。这意味着我得到了一半的句子，谷歌无法使用上下文来帮助自己识别语音。

我的问题是如何让谷歌将我的块视为连续流。或者是否有另一种解决方案可以达到相同的结果？（这是现场转录麦克风音频，或非常接近现场）。

Google 在他们的网站上有一个演示，它完全符合我的要求，所以应该可以做到。

我的代码：（主要来自selfservicekiosk-audio-streaming repo）

ss 是 socket.io-stream

服务端

io.on("connect", (socket) => {
        socket.on("create-room", (data, cb) => createRoom(socket, data, cb))
        socket.on("disconnecting", () => exitFromRoom(socket))

        // getting the stream, it gets called every 1s with a blob
        ss(socket).on("stream-speech", async function (stream: any, data: any) {

            const filename = path.basename("stream.wav")
            const writeStream = fs.createWriteStream(filename)
           
            stream.pipe(writeStream)
            speech.speechStreamToText(
                stream,
                async function (transcribeObj: any) {
                    socket.emit("transcript", transcribeObj.transcript)
                }
            )
        })

async speechStreamToText(stream: any, cb: Function) {
        sttRequest.config.languageCode = "en-US"

        sttRequest = {
            config: {
                sampleRateHertz: 16000,
                encoding: "WEBM_OPUS",
                enableAutomaticPunctuation: true,
            },
            singleUtterance: false,
        }

        const stt = speechToText.SpeechClient()
        //setup the stt stream
        const recognizeStream = stt
            .streamingRecognize(sttRequest)
            .on("data", function (data: any) {
                //this gets called every second and I get transciption chunks which usually have close to no sense
                console.log(data.results[0].alternatives)
            })
            .on("error", (e: any) => {
                console.log(e)
            })
            .on("end", () => {
                //this gets called every second. 
                console.log("on end")
            })

        stream.pipe(recognizeStream)
        stream.on("end", function () {
            console.log("socket.io stream ended")
        })
    }

客户端

const sendBinaryStream = (blob: Blob) => {
    const stream = ss.createStream()
    ss(socket).emit("stream-speech", stream, {
        name: "_temp/stream.wav",
        size: blob.size,
    })
    ss.createBlobReadStream(blob).pipe(stream)
}

useEffect(() => {
        let recorder: any
        if (activeChat) {
            navigator.mediaDevices.getUserMedia({ audio: true, video: false }).then((stream) => {
                streamRef.current = stream
                recorder = new RecordRTC(stream, {
                    type: "audio",
                    mimeType: "audio/webm",
                    sampleRate: 44100,
                    desiredSampleRate: 16000,
                    timeSlice: 1000,
                    numberOfAudioChannels: 1,
                    recorderType: StereoAudioRecorder,
                    ondataavailable(blob: Blob) {
                        sendBinaryStream(blob)
                    },
                })
                recorder.startRecording()
            })
        }
        return () => {
            recorder?.stopRecording()
            streamRef.current?.getTracks().forEach((track) => track.stop())
        }
    }, [])

感谢任何帮助！

Answer 1

我也有同样的问题！

也许谷歌官方演示正在使用 node-record-lpcm16 和 SoX：https://cloud.google.com/speech-to-text/docs/streaming-recognize?hl=en

谷歌语音到文本的实时转录

1 个答案: