ICASSP 2020 - Highlights

Source: pixabay.com

This year ICASSP was supposed to be held in Barcelona, but unfortunately due to the COVID-19 crisis, it had to turn into a virtual conference. Though I missed the typical face-to-face interaction with fellow researchers, I really appreciate the availability of presentation videos online beyond the period of the conference.

There were some really interesting works presented this year also at ICASSP and I am listing my highlights based on my areas of interest of machine learning based methods for

Single-channel speech enhancement and separation

There has been a recent surge in works related to time-domain speech separation mainly due to the ideal-mask-surpassing performance as well as a flexible system design of the TasNet framework. Therefore in ICASSP, time-domain speech separation/enhancement seemed to be one the more popular topics.

A few papers looked into the encoder-decoder design for TasNet architecture:

All three works took quite different approaches towards the encoder-decoder design that were inspired by separate design considerations. It seems we will see this line of investigation in future works also for different related tasks.

While these methods focused on the design of the encoder-decoder, another work

took the approach of pre-training the representation learning part separately, and showed improvements over the end-to-end training procedure adopted by TasNet inspired networks.

Another work

did an in-depth analysis of the benefits of time-domain separation compared to frequency-domain methods by sequentially replacing components of a frequency-domain method to move towards the TasNet approach. The high time-resolution and the time-domain loss seem to be the main source of performance gains for TasNet. However, all the compared methods in the paper still have a hard time with reverberant environments.

Conv-TasNet was also applied for the task of real-time single-channel speech enhancement

where the authors showed that small look-ahead (future time frames) even in the range of 10-20 ms can really help the denoising performance. There were some other interesting works on real-time speech enhancement

Some other interesting works on single-channel speech separation but unrelated to the TasNet idea

Multi-channel speech enhancement and separation

Time-domain methods also made their way into multi-channel approaches

An interesting paper from Amazon and co.

Spatial cue preserving separation for hearing aids using time-domain convolutions was also presented

where a MIMO TasNet was used to obtain stereo output for each source.

Parameter estimation

Our work in this year’s ICASSP was related to making direction-of-arrival (DOA) estimation signal-aware

where we showed a simple mechanism of including signal-awareness in an existing CNN based multi-speaker DOA estimation method that can work in real-time. Checkout the Demo Videos.

Another interesting work on DOA estimation came from Amazon

where they formulated the DOA estimation problem as a two-stage estimation problem. In the first-stage, the DOA range is divided into coarse sectors and the sector of source location is estimated via classification. Within each active sector the estimation of accurate DOA of the sources are formulated as regression problem. The work uses raw waveform for this task.

Other typical parameters of interest in acoustic environment are the T60 and direct-to-reverberant ratio (DRR). There were two interesting papers about this.

One of the most interesting works for me in this ICASSP was this work on pitch estimation

  • Gfeller et al., Pitch Estimation Via Self-Supervision
    The authors propose to train a model for pitch estimation using mainly unlabelled samples and using only small amount of labelled samples for some form of calibration. The main idea is well summarised in the abstract:

    The key to this is the observation that if one creates two examples from one original audio clip by pitch shifting both, the difference between the correct outputs is known, without even knowing the actual pitch value in the original clip. Somewhat surprisingly, this idea combined with an auxiliary reconstruction loss allows training a pitch estimation model.

Overall, things got really interesting this year with all the time-domain methods and some nice combination of signal processing techniques within deep learning models.