Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation

Overview of USDM

Emotion-Rich Single-Turn Conversations from Expresso

Sample 1

User Audio:

It's a very good way to put it. Yeah. He was he was part of the family. No other there's no other cat like like him. I don't. don't. Really imagine who's ever. been replacing him.

USDM Response:

Yeah. I mean, I know he was a bit of a pain in the butt, but, I mean, I don't think anybody could ever replace him. He was just that one of a kind.

Sample 2

User Audio:

Oh my gosh. Could he? Tell me about it. Remember when you used to make cookies and he would intentionally bat it around with his paws? He'd he'd get in the dough, and we'd have to throw everything out again.

USDM Response:

Yeah. That was a lot of wasted dough and a lot of wasted time and energy. That was not a good stretch of our resources here on Earth. Yeah.

Sample 3

User Audio:

Does anybody want to hear about this? This is not a good thing to tell people about.

USDM Response:

I don't think it does sound like a good thing to tell people about.

Multi-Turn Telephony Conversations from Fisher

Sample 1

Conversation History of Two UNSEEN Speakers:

Oh.

Yeah. So when we were in Florida, Orlando Magic, basketball was the big thing because of Shaquille O'Neil.

I see.

But then he got mad at everybody and left us and went to the Lakers. [LAUGH].

[LAUGH]. So you were kinda involved in basketball too, then?

USDM Response:

Yeah, yeah. And um, but then when he went to the Lakers, that was kinda sad. It's like, oh, no. [LAUGH].

Sample 2

Conversation History of Two UNSEEN Speakers:

Oh no [LAUGH] Did you switch over or try to press another thing?

No, it we were talking, and boom, she was gone and I was gone off her line.

Oh go- maybe she did something.

Oh.

Maybe she hung up or did something happened with her phone so they probably disconnected the both of you.

So I gotta start all over again. [LAUGH]

Yeah. [LAUGH] Oh gosh. Um, so what did you talk about last time was it just

USDM Response:

Oh, it was about sports.

Sample 3

Conversation History of Two UNSEEN Speakers:

How did you get into it?

Ah, I was in a forum and, ah, some guys were talking about it, so I checked it out and I just signed up for it. Thought, 'What the heck'. Um, man this thing is, ah I can barely hear you. Can you hear me all right?

Yeah. [COUGH] But I'm on a cell phone, so maybe that's why.

USDM Response:

Oh, okay. Yeah, maybe because I keep getting disconnected.

Sample 4

Conversation History of Two UNSEEN Speakers:

Oh.

So that's been from Chapel Hill so it's been a it's been a big move for me and, you know, I guess the conversations kind of, ah, you know, it the friendship thing is it kind of connects here because I left a lot of friends behind. But, you know, I still try to keep in touch with them and I've made some new friends in LA and I've got a lot of friends up in San- San Francisco.

Okay. Are you mo- are you moving there for a job?

USDM Response:

No. I just kinda I just kinda wanted to go to college up there. I wanted to go to college in San Jose. So.

Model Comparison for DailyTalk

Sample 1

User Audio:

Thank you. So what do you think? Would we have a good time?

Model Responses Comparison:

Ground Truth	USDM	From Scratch	Cascaded	SpeechGPT

Sample 2

User Audio:

The stone is a prehistoric log. It was turned to stone millions of years ago.

Model Responses Comparison:

Ground Truth	USDM	From Scratch	Cascaded	SpeechGPT

Sample 3

User Audio:

Great. Shall I meet you there around seven forty-five tomorrow night?

Model Responses Comparison:

Ground Truth	USDM	From Scratch	Cascaded	SpeechGPT

Model Comparison with GT Text Response for DailyTalk

Sample 1

User Audio:

Are you even glad that we get to buy new clothes too?

Model Responses Comparison:

Ah, that’s all part of the fun. What do you think of these shorts?

Ground Truth	USDM	From Scratch	Cascaded	SpeechGPT

Sample 2

User Audio:

The checkroom is just behind the front door.

Model Responses Comparison:

Ok. I will deposit my bag right now.

Ground Truth	USDM	From Scratch	Cascaded	SpeechGPT

Sample 3

User Audio:

A garage sale… That’s where people check out cars, right?

Model Responses Comparison:

What? Cars? Umm are you serious?

Ground Truth	USDM	From Scratch	Cascaded	SpeechGPT

Unit-to-Speech Reconstruction Analysis (Section 3.1)

We extracted XLS-R based unit from the original audio, and then only those units to reconstruct the audio 3 times. Through that sample, we can understand what information is contained in the units. We used speech from Expresso (Nguyen et al., 2023), Fisher (Cieri et al., 2004), and GigaSpeech (Chen et al., 2021) datasets.

Sample 1

Ground Truth	Reconstructed Audio 1	Reconstructed Audio 2	Reconstructed Audio 3

Sample 2

Ground Truth	Reconstructed Audio 1	Reconstructed Audio 2	Reconstructed Audio 3

Sample 3

Ground Truth	Reconstructed Audio 1	Reconstructed Audio 2	Reconstructed Audio 3

BibTeX

@inproceedings{
	kim2024paralinguisticsaware,
	title={Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation},
	author={Heeseung Kim and Soonshin Seo and Kyeongseok Jeong and Ohsung Kwon and Soyoon Kim and Jungwhan Kim and Jaehong Lee and Eunwoo Song and Myungwoo Oh and Jung-Woo Ha and Sungroh Yoon and Kang Min Yoo},
	booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
	year={2024},
	url={https://openreview.net/forum?id=NjewXJUDYq}
}

Paralinguistics-Aware Speech-Empowered LLMs for Natural Conversation

Overview of USDM

Emotion-Rich Single-Turn Conversations from Expresso

Multi-Turn Telephony Conversations from Fisher

Model Comparison for DailyTalk

Model Comparison with GT Text Response for DailyTalk

Unit-to-Speech Reconstruction Analysis (Section 3.1)

BibTeX