Publications

Voice Attacker Leveraging Multi-Head Factorized Attentive Reconstructor and Gradient Reversal for Random Prosody Anonymization

Published in Paper, 2025

This is the report for Team 04-SpeechWorld in the First VoicePrivacy Attacker Challenge. The attack methods aimed to verify speakers anonymized by two main anonymization systems: STTTS-based and NAC-based. The characteristics of the original audio were reconstructed using speaker embeddings from WavLM-Ecapa and codecs for the NAC system. Additionally, gradient reversal layers were incorporated to eliminate dependencies on prosody features that were randomly simulated by the anonymization models. The results show that the proposed attackers achieved a relative improvement of 26.49% in Equal Error Rate (EER) compared to the baseline, reducing it from 43.22% to 31.77% for the T12-5 attacker system.

Author: Nhan Tri Do

Enhancing Deepfake Detection: A Study Using WavLM and Advanced RawBoost Augmentation Techniques

Published in Paper, 2024

Automatic Speaker Verification (ASV) systems are increas- ingly vulnerable to sophisticated spoofing attacks, particularly those involving deepfake audio. This paper presents our ap- proach to the challenges in Task 1 of the ASVSpoof 2024 com- petition, focusing on deepfake detection to classify utterances is spoof or bonafide. To enhance model robustness, we employed self-supervised learning (SSL) models, specifically fine-tuning WavLM for feature extraction due to its superior performance in noisy environments. We utilized RawBoost augmentation techniques to simulate real-world audio distortions. Experimen- tal results demonstrate that our approach significantly improves detection accuracy, achieving an EER of 2.85% with WavLM and further reducing to 2.69% with a fusion of WavLM and Wav2Vec2 models.

Author: Nhan Tri Do, Loi Nguyen Hoang, Phuong Ta Viet, Kien Phan Trung

FastSpeechStyle: Fast, Emotion Controllable and High-Quality Speech Synthesis

Published in Journal, 2023

The Non-autoregressive text to speech models such as Fastspeech2 can fast synthesize the high quality speech. This model also allows explicit control of the pitch, energy and speed of the speech signal. But, to control the emotion while maintaining the natural human-like speech is still a problems. In this paper, we propose an expressive speech synthesis model that can synthesize high-quality speech with desired emotion. The proposed model includes two main components (1) Mel Emotion Encoder extracts emotion embedding from the Mel-spectrogram of audio, (2) the FastSpeechStyle, a non-autoregressive model, which is modified from vanilla Fastspeech2. The FastSpeechStyle replaces normal LayerNorm with Style Adaptive LayerNorm to “shift” and “scale” hidden features according to emotion embedding, the model also used an improved Conformer block instead of vanilla FFTBlock to better model the local and global dependency in the acoustic model.

Author: Van Thinh Nguyen, Tri-Nhan Do, Hung-Cuong Pham, Tuan Vu Ho, Ngoc-Minh-Khanh Nguyen, Dang-Khoa Mac

Sound Event Detection with Soft Labels using Self-Attention Mechanisms for Global Scene Feature Extraction

Published in Detection and Classification of Acoustic Scenes and Events 2023, 2023

This paper presents our approach to Task 4b of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, which focuses on Sound Event Detection with Soft Labels. Our proposed method builds upon a CRNN backbone model and leverages the benefits of data augmentation techniques to improve model robustness. Furthermore, we introduce self-attention mechanisms to capture global context information and enhance the models ability to predict soft label segments more accurately. Our experiments demonstrate that incorporating soft labels and self-attention mechanisms result in significant performance gains compared to traditional methods on data varying across different scenarios.

Author: Nhan Tri-Do, Param Biyani, Zhang Yuxuan, Andrew Koh Jin Jie, Chng Eng Siong

Vietnamese Speech-based Question Answering over Car Manuals

Published in Intelligent User Interfaces Conference, 2022

This paper presents a novel Vietnamese speech-based question answering system QA-CarManual that enables users to ask car-manual-related questions (e.g. how to properly operate devices and/or utilities within a car). Given a car manual written in Vietnamese as the main knowledge base, we develop QA-CarManual as a lightweight, real-time and interactive system that integrates state-of-the-art technologies in language and speech processing to (i) understand and interact with users via speech commands and (ii) automatically query the knowledge base and return answers in both forms of text and speech as well as visualization. To our best knowledge, QA-CarManual is the first Vietnamese question answering system that interacts with users via speech inputs and outputs. We perform a human evaluation to assess the quality of our QA-CarManual system and obtain promising results.

Author: Tin Duy Vo, Manh Tien Luong, Duong Minh Le, Hieu Minh Tran, Nhan Tri Do, Tuan-Duy Hien Nguyen, Hung Hai Bui, Dat Quoc Nguyen, Dinh Quoc Phung

DeepSpeechVC: Voice Cloning Framework with Speech Synthesis and Voice Conversion - Experiment with Speech2Speech techniques to make voice conversion from a small sample of the target speakers

Published in Thesis, 2021

With motivation of reconstructing voices for people who are mute after an accident or a person who has died and there is a small amount of data about their voices, this thesis aims to conduct experiments and apply new technologies, artificial neural networks to synthesize voices for Vietnamese. Different from speech synthesis models that require a onespeaker quality data set traditional voice cloning, we suggest a new voice clone sytem, which is able to synthesize any person’s voice with only a few sample audio input of that person’s voice.

Author: Tri-Nhan Do, Minh-Tri Nguyen under the advice of Prof. Vu Hai Quan, Msc. Xuan-Nam Cao

Speedyspeech Model with Normalization for Faster Vietnamese Speech Synthesis

Published in VNUHCM-US_CONF_2020, 2020

End to end speech synthesis models have shown better results than traditional methods in terms of intelligence and spontaneity in recent years. However, these network models require a lot of data and training time, and the inference process consumes a lot of GPU resources. With the innovative experiments of SpeedySpeech, the speech synthesis time is shortened and the inference process has ability to run in real-time on the CPU. Authors apply this model for Vietnamese by standardizing the input, replace it with suitable Vietnamese phonemes, improve the embedding layer and the results showed that SpeedySpeech ‘s performance is asymptotic to the Tacotron2 model with significantly shorter training time. Training SpeedySpeech on Colab only takes 19 hours compared to 240 hours of training time on Tacotron2

Author: Do Tri Nhan, Nguyen Minh Tri, Cao Xuan Nam

Vietnamese Speech Synthesis with End-to-End Model and Text Normalization

Published in 7th NAFOSTED Conference on Information and Computer Science (NICS), 2020

Speech synthesis systems are now getting smarter and more natural thanks to the power of deep neural networks. However, each language has a different phonological and contextual characteristics, we have conducted experiments, statistics, and applied Vietnamese phonetics to improve speech synthesis systems based on Tacotron2 neural networks. Our methods achieve the accuracy of 97% in text normalization task, and the synthesized speeches with a MOS score of 3.97, asymptotic to 4.43 of the voices that are directly recorded. We also provide a library for standardizing Vietnamese text called Vinorm and a package that converts text into a phonetic format called Viphoneme, which is used as an input for end-to-end neural networks, make the synthesis process faster, more intelligent and natural than using character inputs.

Author: Do Tri Nhan, Nguyen Minh Tri, Cao Xuan Nam

HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Features with SpecAugment and EfficientNet

Published in MediaEval’20, 2020

MediaEval 2020 provided a subset of the MTG-Jamendo dataset, aimed to recognize mood and theme in music. Team HCMUS proposes several solutions to build efficient classifiers to solve this problem. In addition to the mel-spectrogram features, new features extracted from the wavenet model is extracted and utilized to train the EfficientNet model. As evaluated by the jury, our best result achieved of 0.142 in PR-AUC and 0.76 in the ROC-AUC measurement. With fast training and lightweight features, our proposed methods are potential to work well with deeper neural networks.

Author: Tri-Nhan Do, Minh-Tri Nguyen, Hai-Dang Nguyen, Minh-Triet Tran, Xuan-Nam Cao