ShanghaiTech University Knowledge Management System
Cross-Utterance Conditioned VAE for Speech Generation | |
2024 | |
发表期刊 | IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (IF:4.1[JCR-2023],4.2[5-Year]) |
ISSN | 2329-9290 |
EISSN | 2329-9304 |
卷号 | 32页码:4263-4276 |
发表状态 | 已发表 |
DOI | 10.1109/TASLP.2024.3453598 |
摘要 | Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech. © 2014 IEEE. |
关键词 | Context sensitive languages Neural networks Signal encoding Spectrographs Variational techniques Auto encoders Expressive-speech Language model Natural speech Pre-trained language model Speech editing Speech generation Speech synthesis system TTS Variational autoencoder |
URL | 查看原文 |
收录类别 | EI |
语种 | 英语 |
出版者 | Institute of Electrical and Electronics Engineers Inc. |
EI入藏号 | 20244217217870 |
EI主题词 | Speech enhancement |
EI分类号 | 101.1 ; 1101 ; 1106.1.1 ; 1201.2 ; 1301.1.3.1 ; 716.1 Information Theory and Signal Processing ; 741.3 Optical Devices and Systems ; 751.5 Speech |
原始文献类型 | Journal article (JA) |
来源库 | IEEE |
文献类型 | 期刊论文 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/436537 |
专题 | 信息科学与技术学院_硕士生 创意与艺术学院_PI研究组(P)_田政组 |
通讯作者 | Sun, Fanglei |
作者单位 | 1.The University of Manchester, Department of Computer Science, Manchester; M13 9PL, United Kingdom; 2.ShanghaiTech University, School of Creativity and Art, Shanghai; 201210, China; 3.University of Cambridge, Machine Intelligence Lab, Cambridge; CB2 1TN, United Kingdom; 4.Shanghai Jiao Tong University, School of Electronic, Information and Electrical Engineering (SEIEE), Shanghai; 200240, China; 5.Tsinghua University, Department of Electronic Engineering, Beijing; 100190, China; 6.University College London, Department of Speech Hearing and Phonetic Sciences, London; WC1E 6BT, United Kingdom; 7.University College London, Department of Computer Science, London; WC1E 6BT, United Kingdom; 8.The Hong Kong University of Science and Technology (Guangzhou), Thrust of Internet of Things, Guangzhou; 511453, China; 9.University of Shanghai for Science and Technology, Department of Computer Science and Engineering, Shanghai; 200093, China |
推荐引用方式 GB/T 7714 | Li, Yang,Yu, Cheng,Sun, Guangzhi,et al. Cross-Utterance Conditioned VAE for Speech Generation[J]. IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING,2024,32:4263-4276. |
APA | Li, Yang.,Yu, Cheng.,Sun, Guangzhi.,Zu, Weiqin.,Tian, Zheng.,...&Sun, Fanglei.(2024).Cross-Utterance Conditioned VAE for Speech Generation.IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING,32,4263-4276. |
MLA | Li, Yang,et al."Cross-Utterance Conditioned VAE for Speech Generation".IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 32(2024):4263-4276. |
条目包含的文件 | 下载所有文件 | |||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。