MOSS-TTSD Makes a Stunning Open Source Debut: A Million Hours of Training Creates a New King in AI Podcasts

作者：AI在线 2025-08-02 04:36

MOSS-TTSD (Text to Spoken Dialogue), developed by the Tsinghua University Speech and Language Laboratory (Tencent AI Lab) in collaboration with Shanghai Chuangzhi College, Fudan University, and Musi Intelligent, has been officially open-sourced. This marks a major breakthrough in AI speech synthesis technology for dialogue scenarios.This speech dialogue generation model is based on the Qwen3-1.7B-base model and is trained further using approximately 1 million hours of single-speaker voice data and 400,000 hours of dialog voice data.

This speech dialogue generation model is based on the Qwen3-1.7B-base model and is trained further using approximately 1 million hours of single-speaker voice data and 400,000 hours of dialog voice data. It uses a discrete speech sequence modeling method to achieve high expressive spoken dialogue generation in both Chinese and English, making it particularly suitable for long-form content creation such as AI podcasts, audiobooks, and film and television dubbing.

The core innovation of MOSS-TTSD is its XY-Tokenizer, which adopts a two-stage multi-task learning approach. By using eight RVQ codebooks, it compresses the speech signal to a bitrate of 1 kbps while preserving semantic and acoustic information, ensuring naturalness and fluency in the generated speech. The model supports ultra-long speech generation of up to 960 seconds, avoiding unnatural transitions caused by segment stitching in traditional TTS models. Additionally, MOSS-TTSD has zero-shot voice cloning capabilities, enabling two-person voice cloning by uploading complete dialogues or single-person audio, and supports voice event control, such as laughter, adding more expressiveness to the speech.

Compared to other voice models in the market, MOSS-TTSD significantly outperforms the open-source model MoonCast in objective Chinese metrics, with excellent prosody and naturalness. However, compared to ByteDance's Douba voice model, it slightly lags in tone and rhythm. Nevertheless, with the advantages of being open-source and free for commercial use, MOSS-TTSD still shows strong application potential. Model weights, inference code, and API interfaces are fully open-sourced via GitHub (https://github.com/OpenMOSS/MOSS-TTSD) and HuggingFace (https://huggingface.co/fnlp/MOSS-TTSD-v0.5). Official documentation and online demo experiences are also available, providing developers with convenient access.

The release of MOSS-TTSD brings new vitality to the field of AI speech interaction, especially in scenarios such as long interviews, podcast production, and film and television dubbing, where its stability and expressiveness will drive the intelligent process of content creation. In the future, the team plans to further optimize the model, enhancing the accuracy of speech switching and emotional expression in multi-speaker scenarios.

Address: https://github.com/OpenMOSS/MOSS-TTSD

MOSS-TTSD Makes a Stunning Open Source Debut: A Million Hours of Training Creates a New King in AI Podcasts

相关资讯

MOSS-TTSD震撼开源:百万小时训练打造AI播客新王者

New BeanPod Video Generation Model to Be Released Tomorrow with Support for Seamlessly Multi-Camera Narration and Other Functions

Tencent Hunyuan 3D World Model Makes a Stunning Debut! Experience Immersive 360° Scenes for Free and Discover the Future of AI-Driven Virtual Worlds!