融合句嵌入的VAACGAN多对多语音转换 Many-to-many voice conversion with sentence embedding based on VAACGAN期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

融合句嵌入的VAACGAN多对多语音转换

引用本文：	李燕萍,曹盼,石杨,张燕.融合句嵌入的VAACGAN多对多语音转换[J].北京航空航天大学学报,2021,47(3):500-508.

作者姓名：	李燕萍曹盼石杨张燕

作者单位：	1.南京邮电大学通信与信息工程学院, 南京 210003

基金项目：	金陵科技学院智能人机交互科技创新团队建设专项;国家自然科学基金

摘要：	针对非平行文本条件下语音转换质量不理想、说话人个性相似度不高的问题，提出一种融合句嵌入的变分自编码辅助分类器生成对抗网络（VAACGAN）语音转换方法，在非平行文本条件下，有效实现了高质量的多对多语音转换。辅助分类器生成对抗网络的鉴别器中包含辅助解码器网络，能够在预测频谱特征真假的同时输出训练数据所属的说话人类别，使得生成对抗网络的训练更为稳定且加快其收敛速度。通过训练文本编码器获得句嵌入，将其作为一种语义内容约束融合到模型中，利用句嵌入包含的语义信息增强隐变量表征语音内容的能力，解决隐变量存在的过度正则化效应的问题，有效改善语音合成质量。实验结果表明:所提方法的转换语音平均MCD值较基准模型降低6.67%，平均MOS值提升8.33%，平均ABX值提升11.56%，证明该方法在语音音质和说话人个性相似度方面均有显著提升，实现了高质量的语音转换。
关键词：	语音转换句嵌入文本编码器辅助分类器生成对抗网络(ACGAN) 变分自编码器非平行文本多对多
收稿时间：	2020-08-31
Many-to-many voice conversion with sentence embedding based on VAACGAN

LI Yanping,CAO Pan,SHI Yang,ZHANG Yan.Many-to-many voice conversion with sentence embedding based on VAACGAN[J].Journal of Beijing University of Aeronautics and Astronautics,2021,47(3):500-508.

Authors:	LI Yanping CAO Pan SHI Yang ZHANG Yan

Institution:	1.College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China2.School of Software Engineering, Jinling Institute of Technology, Nanjing 211169, China

Abstract:	To solve the problems of poor speech quality and unsatisfactory speaker similarity for converted speech in existing non-parallel VC methods, this paper presents a novel voice conversion model based on Variational Autoencoding Auxiliary Classifier Generative Adversarial Network (VAACGAN) with sentence embedding, which achieves high-quality many-to-many voice conversion for non-parallel corpora. First, in the ACGAN, the discriminator contains an auxiliary decoder network that predicts the true or false of the spectral feature meanwhile the speaker category to which the training data belongs, thus achieving a more stable training process and a faster iterative convergence. Furthermore, sentence embedding is obtained by training the text encoder, which is introduced into the model as semantic content constraint, can enhance the ability of latent variables to characterize the speech content, effectively solve the over-regularization effect of latent variables and improve the quality of converted speech significantly. Experimental results show that the average value of MCD of the converted speech is decreased by 6.67%, MOS is increased by 8.33%, and ABX is increased by 11.56% compared with baseline method, which demonstrate that the proposed method significantly outperforms the baseline method in both speech quality and speaker similarity and achieves high-quality voice conversion.

Keywords:
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《北京航空航天大学学报》浏览原始摘要信息
	点击此处可从《北京航空航天大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏