CLONE your own voice using AI: F5-TTS Overview & Demo

time
a year ago
view
1 views

Github: https://github.com/SWivid/F5-TTS Paper: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Demonstrations: https://swivid.github.io/F5-TTS/

Model Weights: https://huggingface.co/SWivid/F5-TTS

=====

From Vaibhav (VB) Srivastav:

Trained on 100K hours of data Zero-shot voice cloning Speed control (based on total duration) Emotion based synthesis Long-form synthesis Supports code-switching CC-BY license (commercially permissive)

=====

  1. Non-Autoregressive Design: Uses filler tokens to match text and speech lengths, eliminating complex models like duration and text encoders.
  2. Flow Matching with DiT: Employs flow matching with a Diffusion Transformer (DiT) for denoising and speech generation.
  3. ConvNeXt for Text: used to refine text representation, enhancing alignment with speech.
  4. Sway Sampling: Introduces an inference-time Sway Sampling strategy to boost performance and efficiency, applicable without retraining.
  5. Fast Inference: Achieves an inference Real-Time Factor (RTF) of 0.15, faster than state-of-the-art diffusion-based TTS models.
  6. Multilingual Zero-Shot: Trained on a 100K hours multilingual dataset, demonstrates natural, expressive zero-shot speech, seamless code-switching, and efficient speed control.
Loading comments...
affpapa
sigma-africa
sigma-asia
sigma-europe
GamesSportsStreams