Audiovisual speaker indexing for Web-TV automations
作者:
Highlights:
• Multimodal speaker indexing for video streaming, Web-TV, and dataset annotation.
• Audiovisual Voice Activity Detection, Face/Mouth Tracking, and Sound Localization.
• Modalities are used in parallel to improve accuracy and prevent error transfer.
• Speaker localization results are the core for a broadcasting automations framework.
• A small dataset is required for model fine-tuning and adaptation.
摘要
•Multimodal speaker indexing for video streaming, Web-TV, and dataset annotation.•Audiovisual Voice Activity Detection, Face/Mouth Tracking, and Sound Localization.•Modalities are used in parallel to improve accuracy and prevent error transfer.•Speaker localization results are the core for a broadcasting automations framework.•A small dataset is required for model fine-tuning and adaptation.
论文关键词:Speaker detection,Voice activity detection,Sound localization,Multimodal information fusion
论文评审过程:Received 17 February 2020, Revised 26 August 2021, Accepted 29 August 2021, Available online 7 September 2021, Version of Record 9 September 2021.
论文官网地址:https://doi.org/10.1016/j.eswa.2021.115833