2026-04-17T05:12:34Zhttps://recercat.cat/oai/request

oai:recercat.cat:10230/700252025-12-24T08:36:23Zcom_2072_6col_2072_452952

00925njm 22002777a 4500 dc Montesinos García, Juan Felipe author Kadandale, Venkatesh S. author Haro Ortega, Gloria author 2025-03-27T07:24:43Z 2025-03-27T07:24:43Z 2022 This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/. We acknowledge support by MICINN/FEDER UE project PID2021-127643NB-I00; H2020-MSCA-RISE-2017 project 777826 NoMADS. J.F.M. acknowledges support by FPI scholarship PRE2018-083920. We acknowledge NVIDIA Corporation for the donation of GPUs used for the experiments. Audio-visual Source separation Speech Singing voice VoViT: low latency graph-based audio-visual voice separation transformer