Github: https://github.com/SWivid/F5-TTS Paper: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Demonstrations: https://swivid.github.io/F5-TTS/
Model Weights: https://huggingface.co/SWivid/F5-TTS
=====
From Vaibhav (VB) Srivastav:
Trained on 100K hours of data Zero-shot voice cloning Speed control (based on total duration) Emotion based synthesis Long-form synthesis Supports code-switching CC-BY license (commercially permissive)
=====
- Non-Autoregressive Design: Uses filler tokens to match text and speech lengths, eliminating complex models like duration and text encoders.
- Flow Matching with DiT: Employs flow matching with a Diffusion Transformer (DiT) for denoising and speech generation.
- ConvNeXt for Text: used to refine text representation, enhancing alignment with speech.
- Sway Sampling: Introduces an inference-time Sway Sampling strategy to boost performance and efficiency, applicable without retraining.
- Fast Inference: Achieves an inference Real-Time Factor (RTF) of 0.15, faster than state-of-the-art diffusion-based TTS models.
- Multilingual Zero-Shot: Trained on a 100K hours multilingual dataset, demonstrates natural, expressive zero-shot speech, seamless code-switching, and efficient speed control.




