This website showcases a range of cutting-edge speech technology applications through interactive demonstrations. By leveraging the latest speech technology models, the platform offers insights into the potential of the current state-of-the-art technologies.
With the website, you have the flexibility to input speech in two ways: either by utilizing your microphone or by importing audio files. To use the microphone feature, grant recording permissions to your browser.
Once you've provided the speech input, you can choose to analyze the entire audio or specific portions by simply dragging the waveform. This gives you precise control over the sections you want to explore and evaluate.
The webpage is currently undergoing development, so some occasional hiccups and minor disruptions are to be expected.
We invite you to explore the potential of speech technology on our website!
This demo allows you to discover celebrity voices that sound similar to you using two separate datasets.
The first dataset consists of around 7,000 celebrities sourced from YouTube. It includes multiple video clips for each celebrity, resulting in a total of approximately 800,000 utterances for comparison with the input speech.
The second dataset encompasses over 500 Finnish parliamentary politicians spanning the years 2009 to 2023. Each politician is represented by multiple video clips, resulting in a total of around 10,000 clips available for comparison.
To test the accuracy of the speaker identification system, import or record the speech of individuals in the datasets. You can also explore the system's capabilities by selecting shorter portions of speech to determine the minimum audio clip length required for successful speaker recognition.
In this demo, you can enroll and recognize voices of up to 8 speakers. The analysis window size can be adjusted from 500 ms to 3000 ms, influencing the length of the speech segment used to compute each data point in the result plot. A longer analysis window provides greater recognition score stability but results in higher latency.
In this demo, you can input two speech clips and witness the transformation of their speech contents into each other's voices. While the resulting voices may not precisely resemble the targeted voices, the outcomes can be, nonetheless, intriguing. To uncover some more intruguing results, we encourage you to experiment with converting music, such as electronic beats, into speech.
This demo allows you to synthesize speech from text. The synthesized speech is crafted to mimic the voice of the user-given reference audio. Choose from three accents available: English, French, and Portuguese.
Leverage our fake speech detector to assess the authenticity of speech, distinguishing between genuine and fake recordings. The detector provides not only an overall artificiality score but also a score every 250 ms, computed over a 500 ms window.
At our website, we prioritize your privacy and take the necessary steps to safeguard your information. When you record or import audio files, they are securely transmitted to our server for processing. We want to assure you that once the computation is complete, these audio files are promptly and permanently deleted. We do not store any personal information or audio material.
For a temporary duration, the text inputs provided to the speech synthesizer will be stored in the server's log files and are visible to the developer.
Our celebrity voice search and real-time speaker identification demos employ the ECAPA-TDNN model.
Implementation: https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb
Publication: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2650.pdf
The voice conversion demo employs the FreeVC model.
Implementation: https://github.com/coqui-ai/TTS
Publication: https://arxiv.org/pdf/2210.15418.pdf
The speech synthesis demo employs the YourTTS model.
Implementation: https://github.com/coqui-ai/TTS
Publication: https://arxiv.org/pdf/2112.02418.pdf
Our spoofing detector utilizes a customized (and mostly unoptimized) Wav2Vec 2.0 - AASIST anti-spoofing model. The model is trained using one-second-long audio segments and 80% of training samples are augmented with RawBoost. The model is trained using the following datasets:
Original reference implementation of the model: https://github.com/TakHemlata/SSL_Anti-spoofing
Related publication: https://arxiv.org/pdf/2202.12233
The celebrity voice search employs the VoxCeleb datasets.
Website: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/
Publications:
https://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf
https://www.robots.ox.ac.uk/~vgg/publications/2018/Chung18a/chung18a.pdf
https://www.robots.ox.ac.uk/~vgg/publications/2019/Nagrani19/nagrani19.pdf
The celebrity voice search utilizes a self-collected dataset from the Finnish parliament website.
Ville Vestman, PhD
Website: https://cs.uef.fi/~vvestman/
Google Scholar: https://scholar.google.com/citations?user=aPZBcWgAAAAJ
Professor Tomi H. Kinnunen, PhD
Website: http://cs.joensuu.fi/pages/tkinnu/webpage/
Google Scholar: https://scholar.google.fi/citations?user=e3SPjpoAAAAJ
Generalized Voice Anti-Spoofing and Voice Biometrics (SPEECHFAKES), Academy of Finland (09/2022 – 08/2026).
Ensure your mic is not muted. Check whether the log at the bottom of the page contains helpful information for troubleshooting and resolving the issue.
Sometimes refreshing the page fixes the problem.
The frequency spectrum visualizer becomes active when the microphone is functioning properly.
If you are using Firefox, this issue is caused by a bug in the browser, specifically related to the dynamic creation of an import map. To resolve this, try refreshing the page a few times.
Also ensure that you are using a secure connection (check that the url starts with https://) as the site will not function without it.