Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction

This is the demo page for the paper "Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction" submitted to SSW'21. It is currently for review purpose only.

Examples from the listening test

System Samples
angry sad happy fearful surprised happy neutral
baseline
attention
transformer
rank
copy synth

UI of the listening test. 25 samples were randomly selected. Each one had to be rated on 5-scale MOS and in terms of perceived emotion at the same time.

listening_test

Effect of different scaling of the attention LSTM intensity input

Scaling Samples
angry sad happy fearful surprised happy neutral
0
1
4
7
10

Intensity inputs extracted with different methods for JK_a02

Emotion intensity extracted by input gradients

JK_a02_input_gradients
JK_a02_input_gradients

Emotion intensity extracted by smoothgrad

JK_a02_smoothgrad
JK_a02_smoothgrad

Emotion intensity extracted by inputxgradient

JK_a02_inputxgradient
JK_a02_inputxgradient

Emotion intensity extracted by integrated gradients

JK_a02_integrated_gradients
JK_a02_integrated_gradients