Skip to the content.

Unified Speech-Text Pretraining for Spoken Dialog Modeling

Authors

Abstract

While recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech, an LLM-based strategy for modeling spoken dialogs remains elusive and calls for further investigation. This work proposes an extensive speech-text LLM framework, named the Unified Spoken Dialog Model (USDM), to generate coherent spoken responses with organic prosodic features relevant to the given input speech without relying on automatic speech recognition (ASR) or text-to-speech (TTS) solutions. Our approach employs a multi-step speech-text inference scheme that leverages chain-of-reasoning capabilities exhibited by the underlying LLM. We also propose a generalized speech-text pretraining scheme that helps with capturing cross-modal semantics. Automatic and human evaluations show that the proposed approach is effective in generating natural-sounding spoken responses, outperforming both prior and cascaded baselines. Detailed comparative studies reveal that, despite the cascaded approach being stronger in individual components, the joint speech-text modeling improves robustness against recognition errors and improves speech quality. Demo is available at https://unifiedsdm.github.io.

Model Comparison (DailyTalk)

All samples for DailyTalk resampled to 16kHz.

Model-generated responses

Sample 1

Input User Audio

User (Ground Truth): Not everyone. But a lot of people do, especially the young. It’s a fine place to spend an evening with friends or to make some new friends.

Generated Spoken Response

Ground Truth USDM From Scratch Cascaded SpeechGPT


Sample 2

Input User Audio

User (Ground Truth): I’m very well, Thank you. And you?

Generated Spoken Response

Ground Truth USDM From Scratch Cascaded SpeechGPT


Sample 3

Input User Audio

User (Ground Truth): Yes, do you like it?

Generated Spoken Response

Ground Truth USDM From Scratch Cascaded SpeechGPT


Ground-truth text responses

Sample 1

Input User Audio

User (Ground Truth): Linda? Is that you? I haven’t seen you in ages!

Text to generate: Hi George! It’s good to see you!

Generated Spoken Response

Ground Truth USDM From Scratch Cascaded SpeechGPT


Sample 2

Input User Audio

User (Ground Truth): We are all very proud of you.

Text to generate: I am very happy, too. It was a big game and I won.

Generated Spoken Response

Ground Truth USDM From Scratch Cascaded SpeechGPT


Sample 3

Input User Audio

User (Ground Truth): Ah, that’s all part of the fun. What do you think of these shorts?

Text to generate: They look really good on you. They look comfortable too.

Generated Spoken Response

Ground Truth USDM From Scratch Cascaded SpeechGPT


Multi-Turn Scenarios (Fisher)

All the speakers below are not seen during training.

Sample 1

Input Multi-turn Spoken Dialogues

A: Oh.
B: Yeah. So when we were in Florida, Orlando Magic, basketball was the big thing because of Shaquille O’Neil.
A: I see.
B: But then he got mad at everybody and left us and went to the Lakers. [LAUGH].
A: [LAUGH]. So you were kinda involved in basketball too, then?

Generated Spoken Response


Sample 2

Input Multi-turn Spoken Dialogues

B: Oh no [LAUGH] Did you switch over or try to press another thing?
A: No, it we were talking, and boom, she was gone and I was gone off her line.
B: Oh go- maybe she did something.
A: Oh.
B: Maybe she hung up or did something happened with her phone so they probably disconnected the both of you.
A: So I gotta start all over again. [LAUGH]
B: Yeah. [LAUGH] Oh gosh. Um, so what did you talk about last time was it just

Generated Spoken Response


Sample 3

Input Multi-turn Spoken Dialogues

B: How did you get into it?
A: Ah, I was in a forum and, ah, some guys were talking about it, so I checked it out and I just signed up for it. Thought, “What the heck”. Um, man this thing is, ah I can barely hear you. Can you hear me all right?
B: Yeah. [COUGH] But I’m on a cell phone, so maybe that’s why.

Generated Spoken Response


Sample 4

Input Multi-turn Spoken Dialogues

A: Oh.
B: So that’s been from Chapel Hill so it’s been a it’s been a big move for me and, you know, I guess the conversations kind of, ah, you know, it the friendship thing is it kind of connects here because I left a lot of friends behind. But, you know, I still try to keep in touch with them and I’ve made some new friends in LA and I’ve got a lot of friends up in San- San Francisco.
A: Okay. Are you mo- are you moving there for a job?

Generated Spoken Response



(Bonus) Single-Turn Scenarios with Expressive Dataset (Expresso)

Since there were no transcripts for the data, we used an automatic speech recognition API to obtain transcripts corresponding to the dataset and used them for training.

Sample 1

Input User Audio

User (Ground Truth): It’s a very good way to put it. Yeah. He was he was part of the family. No other there’s no other cat like like him. I don’t. don’t. Really imagine who’s ever. been replacing him.

Generated Spoken Response


Sample 2

Input User Audio

User (Ground Truth): Oh my gosh. Could he? Tell me about it. Remember when you used to make cookies and he would intentionally bat it around with his paws? He’d he’d get in the dough, and we’d have to throw everything out again.

Generated Spoken Response


Sample 3

Input User Audio

User (Ground Truth): Does anybody want to hear about this? This is not a good thing to tell people about.

Generated Spoken Response


Unit-to-Speech Reconstruction Analysis

We extracted XLS-R based unit from the original audio, and then only those units to reconstruct the audio 3 times.
Through that sample, we can understand what information is contained in the units.
We used speech from Expresso (Nguyen et al., 2023), Fisher (Cieri et al., 2004), and GigaSpeech (Chen et al., 2021) datasets.

Sample 1

Ground Truth Reconstructed Audio 1 Reconstructed Audio 2 Reconstructed Audio 3

Sample 2

Ground Truth Reconstructed Audio 1 Reconstructed Audio 2 Reconstructed Audio 3

Sample 3

Ground Truth Reconstructed Audio 1 Reconstructed Audio 2 Reconstructed Audio 3