首页 | 本学科首页   官方微博 | 高级检索  
     检索      

融合句嵌入的VAACGAN多对多语音转换
引用本文:李燕萍,曹盼,石杨,张燕.融合句嵌入的VAACGAN多对多语音转换[J].北京航空航天大学学报,2021,47(3):500-508.
作者姓名:李燕萍  曹盼  石杨  张燕
作者单位:1.南京邮电大学 通信与信息工程学院, 南京 210003
基金项目:金陵科技学院智能人机交互科技创新团队建设专项;国家自然科学基金
摘    要:针对非平行文本条件下语音转换质量不理想、说话人个性相似度不高的问题,提出一种融合句嵌入的变分自编码辅助分类器生成对抗网络(VAACGAN)语音转换方法,在非平行文本条件下,有效实现了高质量的多对多语音转换。辅助分类器生成对抗网络的鉴别器中包含辅助解码器网络,能够在预测频谱特征真假的同时输出训练数据所属的说话人类别,使得生成对抗网络的训练更为稳定且加快其收敛速度。通过训练文本编码器获得句嵌入,将其作为一种语义内容约束融合到模型中,利用句嵌入包含的语义信息增强隐变量表征语音内容的能力,解决隐变量存在的过度正则化效应的问题,有效改善语音合成质量。实验结果表明:所提方法的转换语音平均MCD值较基准模型降低6.67%,平均MOS值提升8.33%,平均ABX值提升11.56%,证明该方法在语音音质和说话人个性相似度方面均有显著提升,实现了高质量的语音转换。 

关 键 词:语音转换    句嵌入    文本编码器    辅助分类器生成对抗网络(ACGAN)    变分自编码器    非平行文本    多对多
收稿时间:2020-08-31

Many-to-many voice conversion with sentence embedding based on VAACGAN
LI Yanping,CAO Pan,SHI Yang,ZHANG Yan.Many-to-many voice conversion with sentence embedding based on VAACGAN[J].Journal of Beijing University of Aeronautics and Astronautics,2021,47(3):500-508.
Authors:LI Yanping  CAO Pan  SHI Yang  ZHANG Yan
Institution:1.College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China2.School of Software Engineering, Jinling Institute of Technology, Nanjing 211169, China
Abstract:To solve the problems of poor speech quality and unsatisfactory speaker similarity for converted speech in existing non-parallel VC methods, this paper presents a novel voice conversion model based on Variational Autoencoding Auxiliary Classifier Generative Adversarial Network (VAACGAN) with sentence embedding, which achieves high-quality many-to-many voice conversion for non-parallel corpora. First, in the ACGAN, the discriminator contains an auxiliary decoder network that predicts the true or false of the spectral feature meanwhile the speaker category to which the training data belongs, thus achieving a more stable training process and a faster iterative convergence. Furthermore, sentence embedding is obtained by training the text encoder, which is introduced into the model as semantic content constraint, can enhance the ability of latent variables to characterize the speech content, effectively solve the over-regularization effect of latent variables and improve the quality of converted speech significantly. Experimental results show that the average value of MCD of the converted speech is decreased by 6.67%, MOS is increased by 8.33%, and ABX is increased by 11.56% compared with baseline method, which demonstrate that the proposed method significantly outperforms the baseline method in both speech quality and speaker similarity and achieves high-quality voice conversion. 
Keywords:
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《北京航空航天大学学报》浏览原始摘要信息
点击此处可从《北京航空航天大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号