About the Website

This website presents a collection of interactive demonstrations showcasing various speech technology applications. The demonstrations highlight the capabilities of modern speech models and provide examples of their practical use.

Speech input can be provided either through your microphone or by uploading an audio file. To use the microphone option, please grant your browser permission to access your microphone.

Once an audio sample has been provided, you can analyze either the entire recording or selected portions of it. Simply drag across the waveform to choose a specific segment for analysis.

The website is currently under active development, and you may occasionally encounter bugs or temporary service interruptions.

We invite you to explore the demonstrations and learn more about current speech technology capabilities.

This demo lets you discover celebrities and public figures whose voices are most similar to yours using three separate datasets: VoxBlink2 (containing natural, in-the-wild speech recordings from 38,000 distinct individuals curated from YouTube), VoxCeleb (recordings from 5,994 distinct celebrities collected from YouTube), and the Finnish Parliament dataset (recordings from 198 members of parliament between 2009 and 2023).

To evaluate the speaker identification system, you can upload or record speech from individuals included in any of these datasets. You can also experiment with shorter speech segments to explore how much audio is needed for reliable speaker recognition.

In this demo, you can enroll and recognize the voices of up to 8 speakers.

The analysis window size can be adjusted between 500 ms and 3000 ms, which determines the length of the speech segment used to compute each data point in the result plot. Larger analysis windows generally produce more stable recognition scores but increase system latency. Smaller windows provide faster updates but may result in greater score variability.

This demo features an interactive 3D speaker map that visualizes similarity relationships between different voices. You can populate the map with speakers from the Finnish Parliament dataset, record or upload your own custom voice, and observe its placement in the 3D space. The graph dynamically draws colored spring lines connecting your voice to the closest parliamentary matches based on speaker similarity embeddings.

This demo generates speech from text. You can synthesize speech in three modes:

  • Random Voice: Generate speech with a random synthetic voice.
  • Voice Clone: Clone a voice from a reference audio sample.
  • Voice Design: Describe the desired voice characteristics with text, and the model will attempt to match them.

This tool analyzes speech to estimate whether a recording is genuine or synthetic. It provides an overall artificiality score, along with time-resolved scores computed from two-second analysis windows with a one-second overlap.

This page allows you to clone a reference voice and immediately verify the quality of the generated audio. For both the original reference audio and the generated synthetic clone, you can see how the speech is rated by integrated speaker verification and anti-spoofing systems, providing automated feedback on both speaker similarity and voice authenticity.

We prioritize your privacy and take appropriate measures to protect your data. When you record or upload audio files, they are securely transmitted to our server for processing. Once processing is complete, the audio files are immediately and permanently deleted. We do not store personal audio data or associated information. Text inputs provided to the speech synthesizer may be temporarily stored on the server.

About Speech Models

The celebrity voice search and real-time speaker identification demos employ the RedimNet2 B2 model.

Implementation: https://github.com/PalabraAI/redimnet2

Publication: https://arxiv.org/abs/2603.11841

The text-to-speech demo employs the VoxCPM2 model.

Implementation: https://github.com/openbmb/VoxCPM

Publication: https://arxiv.org/abs/2509.24650

The spoofing detector uses the DF-Arena 500M V1 anti-spoofing model from Speech-Arena.

Model: https://huggingface.co/Speech-Arena-2025/DF_Arena_500M_V_1

Publication: https://arxiv.org/abs/2509.02859

About Datasets

The VoxBlink2 dataset is a large-scale, audio-visual speaker recognition corpus containing millions of natural speech segments collected from YouTube videos. In its original form, it features over 110,000 distinct speakers and realistic acoustic environments.

Website: https://voxblink2.github.io/

Publication:
https://arxiv.org/abs/2407.11510

The VoxCeleb datasets are large-scale speaker identification datasets containing over a million utterances from celebrity voices extracted from YouTube videos.

Website: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

Publications:
https://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf
https://www.robots.ox.ac.uk/~vgg/publications/2018/Chung18a/chung18a.pdf
https://www.robots.ox.ac.uk/~vgg/publications/2019/Nagrani19/nagrani19.pdf

The Finnish Parliament dataset is a self-collected dataset from the Finnish parliament plenary session broadcasts, featuring recordings from over 500 members of parliament.

Website: https://verkkolahetys.eduskunta.fi/fi/taysistunnot

Member Photos: https://www2.eduskunta.fi/FI/kansanedustajat/Sivut/edustajakuvat.aspx

About Tech Stack and Other Resources

About Us

Generalized Voice Anti-Spoofing and Voice Biometrics (SPEECHFAKES), Academy of Finland (09/2022 – 08/2026).

Q&A and Known Problems

The maximum duration for recording speech is set to either 30, 60, or 120 seconds, depending on the specific demo. Imported audio files are clipped to a maximum duration of 3 minutes. If the duration of the imported audio exceeds the maximum limits mentioned earlier, only a shorter portion can be selected for analysis, rather than the entire file.

The VoxCeleb dataset contains a number of labeling errors. In cases where there is a discrepancy between the name and the voice, the application compares the input speech to the voice in the video, rather than the speaker indicated by the name. Labeling errors are less prevalent in the parliament dataset.

Ensure your mic is not muted. Check whether the log at the bottom of the page contains helpful information for troubleshooting and resolving the issue.

Sometimes refreshing the page fixes the problem.

The frequency spectrum visualizer becomes active when the microphone is functioning properly.

Version History

  • Upgraded the speaker identification model to ReDimNet2-B2.
  • Added the new VoxBlink2 speaker recognition dataset.
  • Introduced the new Speaker Graph and Voice Clone & Verification demos.
  • Added the new text-to-speech demo using the VoxCPM2 model.
  • Removed the old speech synthesis demo based on SpeechT5 and the old voice conversion demo based on FreeVC.
  • Updated the anti-spoofing demo model to DF-Arena 500M V1.
  • Redesigned the entire UI with a unified neon visual theme.
  • Expanded the audio visualizer with toggleable view modes. The original frequency spectrum is now joined by new waveform and spectrogram views.
  • Added speaker images for the parliament dataset, with audio playback available by clicking the images.
  • Removed the gauge.js, js-cookie, jQuery, and Rocher font dependencies.

  • Bug fix: when a new audio file was loaded, the computation was done based on the selection of previous audio file, if some speech segment was selected.
  • Updated anti-spoofing visuals.
  • Updated anti-spoofing model.

  • Automatically change http to https to prevent recording problems.
  • Version history added.

Website ready.

Development began.