Irodori-TTS-600M-v3-VoiceDesign

Code WandB Demo Space

Irodori-TTS-600M-v3-VoiceDesign is an advanced Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. Uniting the architectural enhancements of the v3 series with the caption-driven control concept from v2, this newly developed model introduces a highly flexible Multi-modal Voice Design system.

You can now generate and control speech using any combination of three core elements: Text (Input) + Reference Speech + Caption Text. This allows you to retain a specific speaker's vocal identity (via reference audio) while fully directing their emotion, speaking style, and delivery using a descriptive caption and emoji annotations.

๐ŸŒŸ Key Features

  • Multi-modal Voice Design: Simultaneously condition the generation on a reference audio clip (for voice cloning) and a text caption (for style/emotion control).
  • Flow Matching TTS: Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis.
  • Emoji-based Style Control: Embed emojis directly in the input text for granular control over the delivery and sound effects (e.g., laughter, coughing, sighs). See EMOJI_ANNOTATIONS.md for details.

โœจ What's New in v3 VoiceDesign

This version integrates the architectural improvements of v3 with an evolved Voice Design capability:

  • 3-Factor Control (Text + Ref Voice + Caption): Previously, Voice Design completely replaced the reference audio with a caption. Now, you can use both. Clone a voice and dictate how they speak via text captions.
  • Variable-length Training & Duration Predictor: Utilizes a Duration Predictor for improved training efficiency and enhanced Real-Time Factor (RTF) during inference.
  • Expanded Training Data: Trained on a larger dataset, resulting in more natural speech synthesis and improved robustness across complex styling combinations.
  • Integrated Watermarking: Integrates SilentCipher to apply robust, invisible audio watermarks directly to the generated outputs, promoting responsible AI usage.

๐Ÿ—๏ธ Architecture

The model (approximately 600M parameters) consists of five main components:

  1. Text Encoder: Token embeddings initialized from llm-jp/llm-jp-3-150m, followed by self-attention + SwiGLU transformer layers with RoPE.
  2. Reference Latent Encoder: Encodes patched reference audio latents for speaker identity conditioning.
  3. Caption Encoder: Encodes the style-control text (captions) to define the emotion, tone, and acoustic environment.
  4. Diffusion Transformer: Joint-attention DiT blocks combining text, reference, and caption conditioning with Low-Rank AdaLN, half-RoPE, and SwiGLU MLPs.
  5. Duration Predictor: Predicts audio duration from encoded text and conditioning vectors using stacked SwiGLU MLP blocks.

Audio is represented as continuous latent sequences via the Aratako/Semantic-DACVAE-Japanese-32dim codec (32-dim), enabling high-quality 48kHz waveform reconstruction.


๐ŸŽง Audio Samples

Note: To clearly demonstrate the effect of captions, the samples within each group below were generated using the exact same random seed. The variations in delivery are purely the result of the changed prompts.

1. Pure Voice Design (Text + Caption)

Generate diverse voices and styles purely through descriptive text captions without any reference audio.

Text (Input) Caption (Voice Design) Generated Audio
ๆœฌๆ—ฅใฏใŠ่ถŠใ—ใ„ใŸใ ใใ€่ช ใซใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™ใ€‚ใฉใ†ใžใ”ใ‚†ใฃใใ‚ŠใŠ้Žใ”ใ—ใใ ใ•ใ„ใ€‚ ่ฝใก็€ใ„ใŸๅคงไบบใฎ็”ทๆ€งใ€‚ใƒ•ใ‚ฉใƒผใƒžใƒซใชๅ ดใงใ€ๆทฑใ้Ÿฟใๅฃฐใงไธๅฏงใ‹ใคๆญ“่ฟŽใฎๆ„ใ‚’่พผใ‚ใฆ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚
ๆœฌๆ—ฅใฏใŠ่ถŠใ—ใ„ใŸใ ใใ€่ช ใซใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™ใ€‚ใฉใ†ใžใ”ใ‚†ใฃใใ‚ŠใŠ้Žใ”ใ—ใใ ใ•ใ„ใ€‚ ่‹ฅใๅ…ƒๆฐ—ใชๅฅณๆ€งใฎๅฃฐใ€‚ใ‚ซใƒ•ใ‚งใฎๅบ—ๅ“กใฎใ‚ˆใ†ใซใ€ๆ˜Žใ‚‹ใใƒใ‚ญใƒใ‚ญใจใ—ใŸๅฐ‘ใ—้ซ˜ใ‚ใฎใƒˆใƒผใƒณใง่ฉฑใ—ใฆใ„ใ‚‹ใ€‚
ใ™ใฟใพใ›ใ‚“๏ผใ“ใฎ่ฟ‘ใใซใ‚ณใƒณใƒ“ใƒ‹ใฃใฆใ‚ใ‚Šใพใ™ใ‹๏ผŸใกใ‚‡ใฃใจๆ€ฅใ„ใงใฆใ€้“ใซ่ฟทใฃใกใ‚ƒใฃใŸใฟใŸใ„ใง ไฝŽใ‚ใฎๅฃฐใฎ็”ทๆ€งใŒใ€ไธๅฏงใซ้“ใ‚’ๅฐ‹ใญใฆใ„ใ‚‹ใ€‚็ฉใ‚„ใ‹ใง็คผๅ„€ๆญฃใ—ใใ€ไฝ™่ฃ•ใฎใ‚ใ‚‹ๅฃ่ชฟใ€‚
ใ™ใฟใพใ›ใ‚“๏ผใ“ใฎ่ฟ‘ใใซใ‚ณใƒณใƒ“ใƒ‹ใฃใฆใ‚ใ‚Šใพใ™ใ‹๏ผŸใกใ‚‡ใฃใจๆ€ฅใ„ใงใฆใ€้“ใซ่ฟทใฃใกใ‚ƒใฃใŸใฟใŸใ„ใง ่‹ฅใ„ๅฅณๆ€งใŒใ€ๆ…ŒใฆใŸๆง˜ๅญใงๆ—ฉๅฃใซ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚็„ฆใ‚Šใจไธๅฎ‰ใŒๅฃฐใซใซใ˜ใ‚“ใงใ„ใ‚‹ใ€‚

2. Style-Controlled Voice Cloning (Text + Caption + Ref Speech)

Clone a voice using reference audio, and dictate the specific emotion or delivery style using a caption.

Text (Input) Ref Audio Caption (Voice Design) Generated Audio
ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ ๆทฑใๅ‚ทใคใใ€ไปŠใซใ‚‚ๆณฃใๅ‡บใ—ใใ†ใชๆง˜ๅญใ€‚ๅฃฐใŒ้œ‡ใˆใฆใŠใ‚Šใ€ๆ‚ฒ็—›ใชใƒˆใƒผใƒณใงๅผฑใ€…ใ—ใ่ฉฑใ™ใ€‚
ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ ๆฟ€ใ—ใ„ๆ€’ใ‚Šใ‚’ๆ„Ÿใ˜ใฆใŠใ‚Šใ€ๅฃฐใ‚’่’ใ‚‰ใ’ใฆใ„ใ‚‹ใ€‚็›ธๆ‰‹ใ‚’่ฒฌใ‚็ซ‹ใฆใ‚‹ใ‚ˆใ†ใชๅผทใ„ๅฃ่ชฟใงใ€ๆ„Ÿๆƒ…็š„ใชใƒˆใƒผใƒณใ€‚
ใฉใ†ใ—ใฆใ‚‚ใฃใจๆ—ฉใๆ•™ใˆใฆใใ‚Œใชใ‹ใฃใŸใฎ๏ผŸ็งใ€ใšใฃใจๅพ…ใฃใฆใŸใฎใซใ€‚ ๅฎŒๅ…จใซๅ‘†ใ‚Œ่ฟ”ใฃใฆใ„ใ‚‹ๆง˜ๅญใ€‚ๆ„Ÿๆƒ…ใฎ่ตทไผใŒไนใ—ใใ€ๅ†ทใŸใ„ใƒˆใƒผใƒณใง้™ใ‹ใซ็ชใๆ”พใ™ใ‚ˆใ†ใซ่ฉฑใ™ใ€‚

3. Fully Controlled Generation (Text + Caption + Ref Speech + Emoji)

Combine all control vectors for maximum expressiveness, adding specific physiological sounds (sighs, coughs) or distinct nuances via emojis on top of the cloned and styled voice.

Text (with Emoji) Ref Audio Caption (Voice Design) Generated Audio
ใ‚ใฏใฏใฃ๐Ÿคญใ€ใใ‚Œๆœฌๅฝ“ใซ่จ€ใฃใฆใ‚‹ใฎ๏ผŸโ€ฆ๐Ÿ˜ฎโ€๐Ÿ’จใพใใ€ๅ›ใ‚‰ใ—ใ„ใ‘ใฉใญใ€‚ ไฝ™่ฃ•ใฎใ‚ใ‚‹ๅคงไบบใฎ็”ทๆ€งใ€‚่ฆชใ—ใ„็›ธๆ‰‹ใซๅฏพใ—ใฆใ€ใใ ใ‘ใŸ้›ฐๅ›ฒๆฐ—ใงๅ‘†ใ‚ŒใชใŒใ‚‰ใ‚‚ๆฅฝใ—ใใ†ใซ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚
ใ‚ฒใƒ›ใƒƒใ€ใ‚ฒใƒ›ใƒƒ๐Ÿคงโ€ฆใ”ใ‚ใ‚“ใ€ๅฐ‘ใ—ไผ‘ใพใ›ใฆใ€‚๐Ÿ˜ญไปŠๆ—ฅใฏใ‚‚ใ†็„ก็†ใฟใŸใ„ใ€‚ ไฝ“่ชฟใŒๆ‚ชใใ€้žๅธธใซ่‹ฆใ—ใใ†ใช่‹ฅใ„ๅฅณๆ€งใ€‚ๆฏใ‚‚็ตถใˆ็ตถใˆใซใ€็”ณใ—่จณใชใ•ใใ†ใซๅผฑใ€…ใ—ใ„ๅฃฐใง่ฉฑใ—ใฆใ„ใ‚‹ใ€‚

๐Ÿš€ Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

๐Ÿ‘‰ GitHub: Aratako/Irodori-TTS

๐Ÿ“Š Training Data & Annotation

The model was trained on an expanded, high-quality Japanese speech dataset. To enable the multi-modal Voice Design functionality, the training data was enriched with comprehensive text captions describing the audio characteristics.

The emoji annotations and initial text captions were generated and labeled using a fine-tuned model based on Qwen/Qwen3-Omni-30B-A3B-Instruct. Subsequently, the text captions were rephrased and refined using Qwen/Qwen3.5-35B-A3B.

โš ๏ธ Limitations

  • Japanese Only: This model currently supports Japanese text input only.
  • Conditioning Conflicts: When using both Reference Audio and a Text Caption, providing contradictory instructions (e.g., providing a deep male reference voice but captioning "a high-pitched young girl") may result in unstable audio quality, unnatural artifacts, or one condition overriding the other. For optimal results, use the caption to guide the emotion, style, or environment, while keeping the base voice characteristics aligned with the reference audio.
  • Prompt Adherence: While the model generally follows the caption's instructions, highly complex or contradictory descriptions might result in inconsistent voice generation.
  • Emoji Control: While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent.
  • Kanji Reading Accuracy: The model's ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand.

๐Ÿ“œ License & Ethical Restrictions

License

This model is released under MIT.

Ethical Restrictions

In addition to the license terms, the following ethical restrictions apply:

  1. No Impersonation: Do not use this model to clone or impersonate the voice of any individual (e.g., voice actors, celebrities, public figures) without their explicit consent.
  2. No Misinformation: Do not use this model to generate deepfakes or synthetic speech intended to mislead others or spread misinformation.
  3. Voice Generation Disclaimer: When generating speech purely from text or captions without using a reference audio, it is possible that the generated voice may coincidentally resemble that of a real person. This is strictly a probabilistic artifact within the latent space. The model was not trained with the intent of reproducing specific individuals.
  4. Liability Disclaimer: The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.

๐Ÿ™ Acknowledgments

This project builds upon the following works:

We would also like to extend our special thanks to Respair for the inspiration behind the emoji annotation feature, and to gabrielclark3330 for supporting this project.

๐Ÿ–Š๏ธ Citation

If you use Irodori-TTS in your research or project, please cite it as follows:

@misc{irodori-tts-v3-voicedesign,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-600M-v3-VoiceDesign}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Aratako/Irodori-TTS-600M-v3-VoiceDesign

Finetuned
(9)
this model

Spaces using Aratako/Irodori-TTS-600M-v3-VoiceDesign 2

Collection including Aratako/Irodori-TTS-600M-v3-VoiceDesign