Irodori-TTS-600M-v3-VoiceDesign
Irodori-TTS-600M-v3-VoiceDesign is an advanced Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. Uniting the architectural enhancements of the v3 series with the caption-driven control concept from v2, this newly developed model introduces a highly flexible Multi-modal Voice Design system.
You can now generate and control speech using any combination of three core elements: Text (Input) + Reference Speech + Caption Text. This allows you to retain a specific speaker's vocal identity (via reference audio) while fully directing their emotion, speaking style, and delivery using a descriptive caption and emoji annotations.
๐ Key Features
- Multi-modal Voice Design: Simultaneously condition the generation on a reference audio clip (for voice cloning) and a text caption (for style/emotion control).
- Flow Matching TTS: Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis.
- Emoji-based Style Control: Embed emojis directly in the input text for granular control over the delivery and sound effects (e.g., laughter, coughing, sighs). See
EMOJI_ANNOTATIONS.mdfor details.
โจ What's New in v3 VoiceDesign
This version integrates the architectural improvements of v3 with an evolved Voice Design capability:
- 3-Factor Control (Text + Ref Voice + Caption): Previously, Voice Design completely replaced the reference audio with a caption. Now, you can use both. Clone a voice and dictate how they speak via text captions.
- Variable-length Training & Duration Predictor: Utilizes a Duration Predictor for improved training efficiency and enhanced Real-Time Factor (RTF) during inference.
- Expanded Training Data: Trained on a larger dataset, resulting in more natural speech synthesis and improved robustness across complex styling combinations.
- Integrated Watermarking: Integrates SilentCipher to apply robust, invisible audio watermarks directly to the generated outputs, promoting responsible AI usage.
๐๏ธ Architecture
The model (approximately 600M parameters) consists of five main components:
- Text Encoder: Token embeddings initialized from llm-jp/llm-jp-3-150m, followed by self-attention + SwiGLU transformer layers with RoPE.
- Reference Latent Encoder: Encodes patched reference audio latents for speaker identity conditioning.
- Caption Encoder: Encodes the style-control text (captions) to define the emotion, tone, and acoustic environment.
- Diffusion Transformer: Joint-attention DiT blocks combining text, reference, and caption conditioning with Low-Rank AdaLN, half-RoPE, and SwiGLU MLPs.
- Duration Predictor: Predicts audio duration from encoded text and conditioning vectors using stacked SwiGLU MLP blocks.
Audio is represented as continuous latent sequences via the Aratako/Semantic-DACVAE-Japanese-32dim codec (32-dim), enabling high-quality 48kHz waveform reconstruction.
๐ง Audio Samples
Note: To clearly demonstrate the effect of captions, the samples within each group below were generated using the exact same random seed. The variations in delivery are purely the result of the changed prompts.
1. Pure Voice Design (Text + Caption)
Generate diverse voices and styles purely through descriptive text captions without any reference audio.
| Text (Input) | Caption (Voice Design) | Generated Audio |
|---|---|---|
| ๆฌๆฅใฏใ่ถใใใใ ใใ่ช ใซใใใใจใใใใใพใใใฉใใใใใฃใใใ้ใใใใ ใใใ | ่ฝใก็ใใๅคงไบบใฎ็ทๆงใใใฉใผใใซใชๅ ดใงใๆทฑใ้ฟใๅฃฐใงไธๅฏงใใคๆญ่ฟใฎๆใ่พผใใฆ่ฉฑใใฆใใใ | |
| ๆฌๆฅใฏใ่ถใใใใ ใใ่ช ใซใใใใจใใใใใพใใใฉใใใใใฃใใใ้ใใใใ ใใใ | ่ฅใๅ ๆฐใชๅฅณๆงใฎๅฃฐใใซใใงใฎๅบๅกใฎใใใซใๆใใใใญใใญใจใใๅฐใ้ซใใฎใใผใณใง่ฉฑใใฆใใใ | |
| ใใฟใพใใ๏ผใใฎ่ฟใใซใณใณใใใฃใฆใใใพใใ๏ผใกใใฃใจๆฅใใงใฆใ้ใซ่ฟทใฃใกใใฃใใฟใใใง | ไฝใใฎๅฃฐใฎ็ทๆงใใไธๅฏงใซ้ใๅฐใญใฆใใใ็ฉใใใง็คผๅๆญฃใใใไฝ่ฃใฎใใๅฃ่ชฟใ | |
| ใใฟใพใใ๏ผใใฎ่ฟใใซใณใณใใใฃใฆใใใพใใ๏ผใกใใฃใจๆฅใใงใฆใ้ใซ่ฟทใฃใกใใฃใใฟใใใง | ่ฅใๅฅณๆงใใๆ ใฆใๆงๅญใงๆฉๅฃใซ่ฉฑใใฆใใใ็ฆใใจไธๅฎใๅฃฐใซใซใใใงใใใ |
2. Style-Controlled Voice Cloning (Text + Caption + Ref Speech)
Clone a voice using reference audio, and dictate the specific emotion or delivery style using a caption.
| Text (Input) | Ref Audio | Caption (Voice Design) | Generated Audio |
|---|---|---|---|
| ใฉใใใฆใใฃใจๆฉใๆใใฆใใใชใใฃใใฎ๏ผ็งใใใฃใจๅพ ใฃใฆใใฎใซใ | ๆทฑใๅทใคใใไปใซใๆณฃใๅบใใใใชๆงๅญใๅฃฐใ้ใใฆใใใๆฒ็ใชใใผใณใงๅผฑใ ใใ่ฉฑใใ | ||
| ใฉใใใฆใใฃใจๆฉใๆใใฆใใใชใใฃใใฎ๏ผ็งใใใฃใจๅพ ใฃใฆใใฎใซใ | ๆฟใใๆใใๆใใฆใใใๅฃฐใ่ใใใฆใใใ็ธๆใ่ฒฌใ็ซใฆใใใใชๅผทใๅฃ่ชฟใงใๆๆ ็ใชใใผใณใ | ||
| ใฉใใใฆใใฃใจๆฉใๆใใฆใใใชใใฃใใฎ๏ผ็งใใใฃใจๅพ ใฃใฆใใฎใซใ | ๅฎๅ จใซๅใ่ฟใฃใฆใใๆงๅญใๆๆ ใฎ่ตทไผใไนใใใๅทใใใใผใณใง้ใใซ็ชใๆพใใใใซ่ฉฑใใ |
3. Fully Controlled Generation (Text + Caption + Ref Speech + Emoji)
Combine all control vectors for maximum expressiveness, adding specific physiological sounds (sighs, coughs) or distinct nuances via emojis on top of the cloned and styled voice.
| Text (with Emoji) | Ref Audio | Caption (Voice Design) | Generated Audio |
|---|---|---|---|
| ใใฏใฏใฃ๐คญใใใๆฌๅฝใซ่จใฃใฆใใฎ๏ผโฆ๐ฎโ๐จใพใใๅใใใใใฉใญใ | ไฝ่ฃใฎใใๅคงไบบใฎ็ทๆงใ่ฆชใใ็ธๆใซๅฏพใใฆใใใ ใใ้ฐๅฒๆฐใงๅใใชใใใๆฅฝใใใใซ่ฉฑใใฆใใใ | ||
| ใฒใใใใฒใใ๐คงโฆใใใใๅฐใไผใพใใฆใ๐ญไปๆฅใฏใใ็ก็ใฟใใใ | ไฝ่ชฟใๆชใใ้ๅธธใซ่ฆใใใใช่ฅใๅฅณๆงใๆฏใ็ตถใ็ตถใใซใ็ณใ่จณใชใใใใซๅผฑใ ใใๅฃฐใง่ฉฑใใฆใใใ |
๐ Usage
For inference code, installation instructions, and training scripts, please refer to the GitHub repository:
๐ GitHub: Aratako/Irodori-TTS
๐ Training Data & Annotation
The model was trained on an expanded, high-quality Japanese speech dataset. To enable the multi-modal Voice Design functionality, the training data was enriched with comprehensive text captions describing the audio characteristics.
The emoji annotations and initial text captions were generated and labeled using a fine-tuned model based on Qwen/Qwen3-Omni-30B-A3B-Instruct. Subsequently, the text captions were rephrased and refined using Qwen/Qwen3.5-35B-A3B.
โ ๏ธ Limitations
- Japanese Only: This model currently supports Japanese text input only.
- Conditioning Conflicts: When using both Reference Audio and a Text Caption, providing contradictory instructions (e.g., providing a deep male reference voice but captioning "a high-pitched young girl") may result in unstable audio quality, unnatural artifacts, or one condition overriding the other. For optimal results, use the caption to guide the emotion, style, or environment, while keeping the base voice characteristics aligned with the reference audio.
- Prompt Adherence: While the model generally follows the caption's instructions, highly complex or contradictory descriptions might result in inconsistent voice generation.
- Emoji Control: While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent.
- Kanji Reading Accuracy: The model's ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand.
๐ License & Ethical Restrictions
License
This model is released under MIT.
Ethical Restrictions
In addition to the license terms, the following ethical restrictions apply:
- No Impersonation: Do not use this model to clone or impersonate the voice of any individual (e.g., voice actors, celebrities, public figures) without their explicit consent.
- No Misinformation: Do not use this model to generate deepfakes or synthetic speech intended to mislead others or spread misinformation.
- Voice Generation Disclaimer: When generating speech purely from text or captions without using a reference audio, it is possible that the generated voice may coincidentally resemble that of a real person. This is strictly a probabilistic artifact within the latent space. The model was not trained with the intent of reproducing specific individuals.
- Liability Disclaimer: The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.
๐ Acknowledgments
This project builds upon the following works:
- Echo-TTS โ Architecture and training design reference
- DACVAE โ Audio VAE
- llm-jp/llm-jp-3-150m โ Tokenizer and embedding weight initialization
- SilentCipher โ Audio watermarking integration
We would also like to extend our special thanks to Respair for the inspiration behind the emoji annotation feature, and to gabrielclark3330 for supporting this project.
๐๏ธ Citation
If you use Irodori-TTS in your research or project, please cite it as follows:
@misc{irodori-tts-v3-voicedesign,
author = {Chihiro Arata},
title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign}}
}
Model tree for Aratako/Irodori-TTS-600M-v3-VoiceDesign
Base model
Aratako/Irodori-TTS-500M-v2