Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Authors
- Heeseung Kim gmltmd789@snu.ac.kr
- Soonshin Seo sunshin.seo@navercorp.com
- Kyeongseok Jeong ks.jeong@navercorp.com
- Ohsung Kwon osung.kwon@navercorp.com
- Soyoon Kim soyoon.kim@navercorp.com
- Jungwhan Kim jungwhan.kim@navercorp.com
- Jaehong Lee jaehong.l@navercorp.com
- Eunwoo Song eunwoo.song@navercorp.com
- Myungwoo Oh myungwoo.oh@navercorp.com
- Jung-Woo Ha jungwoo.ha@navercorp.com
- Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr
- Kang Min Yoo (Corresponding author) kangmin.yoo@navercorp.com
Abstract
Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics. To construct USDM, we fine-tune our speech-text model on spoken dialog data using a multi-step spoken dialog template that stimulates the chain-of-reasoning capabilities exhibited by the underlying LLM. Automatic and human evaluations on the DailyTalk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. Our code and checkpoints are available at https://github.com/naver-ai/usdm.
Additional Samples For Various Scenarios (Appendix A.1)
Expressive Spoken Dialog Dataset (Expresso)
Since there were no transcripts for Expresso, we used whisper-large-v3 to obtain transcripts corresponding to the dataset and used them for training.
Sample 1
Input User Audio
User (Ground Truth): It’s a very good way to put it. Yeah. He was he was part of the family. No other there’s no other cat like like him. I don’t. don’t. Really imagine who’s ever. been replacing him.
Generated Text Response: Yeah. I mean, I know he was a bit of a pain in the butt, but, I mean, I don’t think anybody could ever replace him. He was just that one of a kind of
Generated Spoken Response
Sample 2
Input User Audio
User (Ground Truth): Oh my gosh. Could he? Tell me about it. Remember when you used to make cookies and he would intentionally bat it around with his paws? He’d he’d get in the dough, and we’d have to throw everything out again.
Generated Text Response: Yeah. That was a lot of wasted dough and a lot of waste time and energy. That was not a good stretch of our resources here on Earth. Yeah.
Generated Spoken Response
Sample 3
Input User Audio
User (Ground Truth): Does anybody want to hear about this? This is not a good thing to tell people about.
Generated Text Response: I don’t think it does sound like a good thing to tell people about.
Generated Spoken Response
Multi-Turn Scenarios (Fisher)
All the speakers below are not seen during training.
Sample 1
Input Multi-turn Spoken Dialogues
A: Oh.
B: Yeah. So when we were in Florida, Orlando Magic, basketball was the big thing because of Shaquille O’Neil.
A: I see.
B: But then he got mad at everybody and left us and went to the Lakers. [LAUGH].
A: [LAUGH]. So you were kinda involved in basketball too, then?
Generated Text Response: Yeah, yeah. And um, but then when he went to the Lakers, that was kinda sad. It’s like, oh, no. [LAUGH].
Generated Spoken Response
Sample 2
Input Multi-turn Spoken Dialogues
B: Oh no [LAUGH] Did you switch over or try to press another thing?
A: No, it we were talking, and boom, she was gone and I was gone off her line.
B: Oh go- maybe she did something.
A: Oh.
B: Maybe she hung up or did something happened with her phone so they probably disconnected the both of you.
A: So I gotta start all over again. [LAUGH]
B: Yeah. [LAUGH] Oh gosh. Um, so what did you talk about last time was it just
Generated Text Response: Oh, it was about sports.
Generated Spoken Response
Sample 3
Input Multi-turn Spoken Dialogues
B: How did you get into it?
A: Ah, I was in a forum and, ah, some guys were talking about it, so I checked it out and I just signed up for it. Thought, “What the heck”. Um, man this thing is, ah I can barely hear you. Can you hear me all right?
B: Yeah. [COUGH] But I’m on a cell phone, so maybe that’s why.
Generated Text Response: Oh, okay. Yeah, maybe because I keep getting disconnected.
Generated Spoken Response
Sample 4
Input Multi-turn Spoken Dialogues
A: Oh.
B: So that’s been from Chapel Hill so it’s been a it’s been a big move for me and, you know, I guess the conversations kind of, ah, you know, it the friendship thing is it kind of connects here because I left a lot of friends behind. But, you know, I still try to keep in touch with them and I’ve made some new friends in LA and I’ve got a lot of friends up in San- San Francisco.
A: Okay. Are you mo- are you moving there for a job?
Generated Text Response: No. I just kinda I just kinda wanted to go to college up there. I wanted to go to college in San Jose. So.
Generated Spoken Response
Model Comparison (DailyTalk, Section 4.1.2)
All samples for DailyTalk resampled to 16kHz.
Spoken Response (with model-generated text response)
Sample 1
Input User Audio
User: thank you. so what do you think? would we have a good time?
Ground Truth | USDM | From Scratch | Cascaded | SpeechGPT |
---|---|---|---|---|
Sample 2
Input User Audio
User: the stone is a prehistoric log. it was turned to stone millions of years ago.
Ground Truth | USDM | From Scratch | Cascaded | SpeechGPT |
---|---|---|---|---|
Sample 3
Input User Audio
User: great. shall i meet you there around seven forty-five tomorrow night?
Ground Truth | USDM | From Scratch | Cascaded | SpeechGPT |
---|---|---|---|---|
Spoken responses (with ground-truth text response)
Sample 1
Input User Audio
User: are you even glad that we get to buy new clothes too?
Text to generate: ah, that’s all part of the fun. what do you think of these shorts?
Ground Truth | USDM | From Scratch | Cascaded | SpeechGPT |
---|---|---|---|---|
Sample 2
Input User Audio
User: the checkroom is just behind the front door.
Text to generate: ok. i will deposit my bag right now.
Ground Truth | USDM | From Scratch | Cascaded | SpeechGPT |
---|---|---|---|---|
Sample 3
Input User Audio
User: a garage sale… that’s where people check out cars, right?
Text to generate: what? cars? umm are you serious?
Ground Truth | USDM | From Scratch | Cascaded | SpeechGPT |
---|---|---|---|---|
Unit-to-Speech Reconstruction Analysis (Section 3.1)
We extracted XLS-R based unit from the original audio, and then only those units to reconstruct the audio 3 times.
Through that sample, we can understand what information is contained in the units.
We used speech from Expresso (Nguyen et al., 2023), Fisher (Cieri et al., 2004), and GigaSpeech (Chen et al., 2021) datasets.
Sample 1
Ground Truth | Reconstructed Audio 1 | Reconstructed Audio 2 | Reconstructed Audio 3 |
---|---|---|---|
Sample 2
Ground Truth | Reconstructed Audio 1 | Reconstructed Audio 2 | Reconstructed Audio 3 |
---|---|---|---|
Sample 3
Ground Truth | Reconstructed Audio 1 | Reconstructed Audio 2 | Reconstructed Audio 3 |
---|---|---|---|
BibTeX
@inproceedings{
kim2024paralinguisticsaware,
title={Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation},
author={Heeseung Kim and Soonshin Seo and Kyeongseok Jeong and Ohsung Kwon and Soyoon Kim and Jungwhan Kim and Jaehong Lee and Eunwoo Song and Myungwoo Oh and Jung-Woo Ha and Sungroh Yoon and Kang Min Yoo},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=NjewXJUDYq}
}