To access the full text documents, please follow this link: http://hdl.handle.net/2117/123773

The use of long-term features for GMM- and i-vector-based speaker diarization systems
Zewoudie, Abraham Woubie; Luque, Jordi; Hernando Pericás, Francisco Javier
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions; Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.
Peer Reviewed
-Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic
-Automatic speech recognition
-Cosine-distance
-Fusion
-GNE
-i-Vector
-Jitter
-PLDA
-Prosody
-Segmentation
-Clustering
-Shimmer
-Reconeixement automàtic de la parla
Attribution-NonCommercial-NoDerivs 3.0 Spain
http://creativecommons.org/licenses/by-nc-nd/3.0/es/
Article - Published version
Article
         

Show full item record

Related documents

Other documents of the same author

Zewoudie, Abraham Woubie; Luque, Jordi; Hernando Pericás, Francisco Javier
Zewoudie, Abraham Woubie; Luque, Jordi; Hernando Pericás, Francisco Javier
Zewoudie, Abraham Woubie; Luque, Jordi; Hernando Pericás, Francisco Javier
Zelenak, Martin; Segura Perales, Carlos; Luque, Jordi; Hernando Pericás, Francisco Javier
Salah, Albert Ali; Morros Rubió, Josep Ramon; Luque, Jordi; Segura Perales, Carlos; Hernando Pericás, Francisco Javier; Ambekar, Onkar; Schouten, Ben; Pauwels, Eric
 

Coordination

 

Supporters