Name: Multi-Stage Real-Time Music Source Separation Using Random and Aligned Mixing with individualized GAN Loss
Start: 2026-07-31T10:50:00-0400
End: 2026-07-31T11:15:00-0400

Schedule as of May 2026 - subject to change

Default Time Zone is EDT - Eastern Daylight Time

Multi-Stage Real-Time Music Source Separation Using Random and Aligned Mixing with individualized GAN Loss

Friday July 31, 2026 10:50am - 11:15am EDT

Hall C

As intelligent cockpits develop, music source separation (MSS) is increasingly used in automotive audio to address complex sound mixtures failing to meet users’ diverse needs. In karaoke, it separates vocals and accompaniment for humming and enables independent male/female vocal volume adjustment for duets. For in-vehicle audio up-mixing, extracted stems reconfigure stereo mixes into cockpit-optimized multichannel layouts. For real-time rendering, it is required to enhance specific tracks to adapt to cabin noise and user preferences. However, audio sources other than vocal, bass and drums lack research for real-time automotive applications. This paper proposes targeted optimizations for data augmentation and model structure: for target tracks (guitar, piano, male/female lead/backing vocals), a parallel single-track model is used for piano/guitar separation, and a two-stage model for male/female voice separation (first separating general vocals, then splitting into lead and backing vocals). A "Random Mixing" and "Aligned Mixing" combined method adapts to harmonic overlap in real songs. In terms of loss function, besides time-domain L1-loss and Multi-scale STFT loss, a GAN-based training procedure with individualized discriminators for each instrument stem improves audio quality and separation accuracy. Training uses a dataset from MedleyDB, MoiseDB and 3,000 private songs. To enhance the real-time causal model’s time-dimension receptive field, a modified SCNet with dilated convolutions and source-based band split is adopted. The model achieves 64 ms latency with 4.36M parameters, and its SDR values (6 dB piano, 5.6 dB guitar, 8.9 dB male vocals, 7.6 dB female vocals) outperform SOTA models like DTTNet and SCNet.

Speakers

Jianyuan Feng

Friday July 31, 2026 10:50am - 11:15am EDT
Hall C

Machine learning and deep learning in automotive audio applications, New Technologies for Automotive Audio

All Access 19 & 20 Paper Presentation

AES 2026 Automotive Audio Conference

Jianyuan Feng

Get help with the event

AES 2026 Automotive Audio Conference

Jianyuan Feng

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event