APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Time-domain Separation Priority Pipeline-based Cascaded Multi-task Learning for Monaural Noisy and Reverberant Speech Separation

Shaoxiang Dang, Nagoya University, Japan, dang.shaoxiang.s0@s.mail.nagoya-u.ac.jp , Tetsuya Matsumoto, Nagoya University, Japan, Yoshinori Takeuchi, Daido University, Japan, Hiroaki Kudo, Nagoya University, Japan
 
Suggested Citation
Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi and Hiroaki Kudo (2025), "Time-domain Separation Priority Pipeline-based Cascaded Multi-task Learning for Monaural Noisy and Reverberant Speech Separation", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e23. http://dx.doi.org/10.1561/116.20250022

Publication Date: 28 Aug 2025
© 2025 S. Dang, T. Matsumoto, Y. Takeuchi and H. Kudo
 
Subjects
Audio signal processing,  Enhancement,  Source separation,  Signal reconstruction,  Deep learning
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 5 times

In this article:
Introduction 
Problem Formulation and Related Works 
Proposed Methods 
Experiments 
Results 
Conclusion 
References 

Abstract

Monaural speech separation is a crucial task in speech processing, focused on isolating single-channel audio with multiple speakers into individual streams. This problem is particularly challenging in noisy and reverberant environments where the target information becomes obscured. Cascaded multi-task learning breaks down complex tasks into simpler sub-tasks and leverages additional information for step-by-step learning, serving as an effective approach for integrating multiple objectives. However, its sequential nature often leads to over-suppression, degrading the performance of downstream modules. This article presents three main contributions. First, we propose a separation-priority pipeline to ensure that the critical separation sub-task is preserved against over-suppression. Second, to extract deeper multi-scale features, we design a consistent-stride deep encoder-decoder structure combined with depth-wise multi-receptive field fusion. Third, we advocate a training strategy that pre-trains each sub-task and applies time-varying and time-invariant weighted fine-tuning to further mitigate over-suppression. Our methods are evaluated on the open-source Libri2Mix and real-world LibriCSS datasets. Experimental results across diverse metrics demonstrate that all proposed innovations improve overall model performance.

DOI:10.1561/116.20250022