About

About the Website

This website showcases a range of cutting-edge speech technology applications through interactive demonstrations. By leveraging the latest speech technology models, the platform offers insights into the potential of the current state-of-the-art technologies.

With the website, you have the flexibility to input speech in two ways: either by utilizing your microphone or by importing audio files. To use the microphone feature, grant recording permissions to your browser.

Once you've provided the speech input, you can choose to analyze the entire audio or specific portions by simply dragging the waveform. This gives you precise control over the sections you want to explore and evaluate.

The webpage is currently undergoing development, so some occasional hiccups and minor disruptions are to be expected.

We invite you to explore the potential of speech technology on our website!

This demo allows you to discover celebrity voices that sound similar to you using two separate datasets.

The first dataset consists of around 7,000 celebrities sourced from YouTube. It includes multiple video clips for each celebrity, resulting in a total of approximately 800,000 utterances for comparison with the input speech.

The second dataset encompasses over 500 Finnish parliamentary politicians spanning the years 2009 to 2023. Each politician is represented by multiple video clips, resulting in a total of around 10,000 clips available for comparison.

To test the accuracy of the speaker identification system, import or record the speech of individuals in the datasets. You can also explore the system's capabilities by selecting shorter portions of speech to determine the minimum audio clip length required for successful speaker recognition.

In this demo, you can enroll and recognize voices of up to 8 speakers. The analysis window size can be adjusted from 500 ms to 3000 ms, influencing the length of the speech segment used to compute each data point in the result plot. A longer analysis window provides greater recognition score stability but results in higher latency.

In this demo, you can input two speech clips and witness the transformation of their speech contents into each other's voices. While the resulting voices may not precisely resemble the targeted voices, the outcomes can be, nonetheless, intriguing. To uncover some more intruguing results, we encourage you to experiment with converting music, such as electronic beats, into speech.

This demo allows you to synthesize speech from text. The synthesized speech is crafted to mimic the voice of the user-given reference audio. Choose from three accents available: English, French, and Portuguese.

Leverage our fake speech detector to assess the authenticity of speech, distinguishing between genuine and fake recordings. The detector provides not only an overall artificiality score but also a score every 250 ms, computed over a 500 ms window.

At our website, we prioritize your privacy and take the necessary steps to safeguard your information. When you record or import audio files, they are securely transmitted to our server for processing. We want to assure you that once the computation is complete, these audio files are promptly and permanently deleted. We do not store any personal information or audio material.

For a temporary duration, the text inputs provided to the speech synthesizer will be stored in the server's log files and are visible to the developer.

About Speech Models

Our celebrity voice search and real-time speaker identification demos employ the ECAPA-TDNN model.

Implementation: https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb

Publication: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2650.pdf

The voice conversion demo employs the FreeVC model.

Implementation: https://github.com/coqui-ai/TTS

Publication: https://arxiv.org/pdf/2210.15418.pdf

The speech synthesis demo employs the YourTTS model.

Implementation: https://github.com/coqui-ai/TTS

Publication: https://arxiv.org/pdf/2112.02418.pdf

Our spoofing detector utilizes a customized (and mostly unoptimized) Wav2Vec 2.0 - AASIST anti-spoofing model. The model is trained using one-second-long audio segments and 80% of training samples are augmented with RawBoost. The model is trained using the following datasets:

Original reference implementation of the model: https://github.com/TakHemlata/SSL_Anti-spoofing

Related publication: https://arxiv.org/pdf/2202.12233

About Datasets

The celebrity voice search employs the VoxCeleb datasets.

Website: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

Publications:
https://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf
https://www.robots.ox.ac.uk/~vgg/publications/2018/Chung18a/chung18a.pdf
https://www.robots.ox.ac.uk/~vgg/publications/2019/Nagrani19/nagrani19.pdf

The celebrity voice search utilizes a self-collected dataset from the Finnish parliament website.

Website: https://verkkolahetys.eduskunta.fi/fi/taysistunnot

About Tech Stack and Other Resources

Javascript ES10
Python web framework: Flask
Front-end framework: Bootstrap 5
jQuery
Cookies: js-cookie
Waveforms and spectrograms: WaveSurfer.js
Gauge: gauge.js
Icons: Ionicons
Font: Rocher Color
Text polishing: ChatGPT

Machine learning framework: Pytorch
Speech AI toolkit: SpeechBrain
Interprocess communication: ZeroMQ
Scientific computing: NumPy
Audio functions: Librosa

Flask tutorials: YouTube link
Animated text tutorial: YouTube link

About Us

Ville Vestman, PhD

Website: https://cs.uef.fi/~vvestman/

Google Scholar: https://scholar.google.com/citations?user=aPZBcWgAAAAJ

Professor Tomi H. Kinnunen, PhD

Website: http://cs.joensuu.fi/pages/tkinnu/webpage/

Google Scholar: https://scholar.google.fi/citations?user=e3SPjpoAAAAJ

Generalized Voice Anti-Spoofing and Voice Biometrics (SPEECHFAKES), Academy of Finland (09/2022 – 08/2026).

Computational Speech Group at the School of Computing, University of Eastern Finland.

Q&A and Known Problems

The maximum duration for recording speech is set to either 30, 60, or 120 seconds, depending on the specific demo. Imported audio files are clipped to a maximum duration of 3 minutes. If the duration of the imported audio exceeds the maximum limits mentioned earlier, only a shorter portion can be selected for analysis, rather than the entire file.

The VoxCeleb dataset contains a number of labeling errors. In cases where there is a discrepancy between the name and the voice, the application compares the input speech to the voice in the video, rather than the speaker indicated by the name. Labeling errors are less prevalent in the parliament dataset.

Ensure your mic is not muted. Check whether the log at the bottom of the page contains helpful information for troubleshooting and resolving the issue.

Sometimes refreshing the page fixes the problem.

The frequency spectrum visualizer becomes active when the microphone is functioning properly.

If you are using Firefox, this issue is caused by a bug in the browser, specifically related to the dynamic creation of an import map. To resolve this, try refreshing the page a few times.

Also ensure that you are using a secure connection (check that the url starts with https://) as the site will not function without it.

Version History

Bug fix: when a new audio file was loaded, the computation was done based on the selection of previous audio file, if some speech segment was selected.
Updated anti-spoofing visuals.
Updated anti-spoofing model.

Automatically change http to https to prevent recording problems.
Version history added.

Website ready.

Development began.

About the Website

Website Overview

Celebrity Voice Search

Real-Time Speaker Identifier

Voice Converter

Speech Synthesizer

Anti-Spoofing Tool

Privacy Policy

About Speech Models

Speaker Identification Model

Voice Conversion Model

Speech Synthesis Model

Anti-spoofing Model

About Datasets

YouTube Data

Finnish Parliament Data

About Tech Stack and Other Resources

Website

Backend

Useful Resources

About Us

Developer

Project Lead

Project

Research Group

Q&A and Known Problems

Is there a maximum duration limit for the recordings?

There is a mismatch between the celebs's name and the actual speaker in the celebrity demo.

I recorded some speech, but when playing the recorded audio, I cannot hear anything!

Buttons do not show up on the page (the page is not loading properly)!

Version History

August 2024

September 2023

July 2023

May 2023