The Euclidean distance between wavelet scattering transform coefficients (known as paths) provides informative gradients for perceptual quality assessment of deep inverse problems in computer vision, speech, and audio processing. However, these transforms are computationally expensive when employed as differentiable loss functions for stochastic gradient descent due to their numerous paths, which significantly limits their use in neural network training. Against this problem, we propose "Scattering transform with Random Paths for machine Learning" (SCRAPL): a stochastic optimization scheme for efficient evaluation of multivariable scattering transforms. We implement SCRAPL for the joint time–frequency scattering transform (JTFS) which demodulates spectrotemporal patterns at multiple scales and rates, allowing a fine characterization of intermittent auditory textures. We apply SCRAPL to differentiable digital signal processing (DDSP), specifically, unsupervised sound matching of a granular synthesizer and the Roland TR-808 drum machine. We also propose an initialization heuristic based on importance sampling, which adapts SCRAPL to the perceptual content of the dataset, improving neural network convergence and evaluation performance. We make our audio samples available and provide SCRAPL as a Python package.
Figure 1: Mean average error (y-axis) versus computational cost (x-axis) of unsupervised sound matching models for the granular synthesis task. Both axes are rescaled by the performance of a supervised model with same number of parameters. Whiskers denote 95% CI, estimated over 20 random seeds. Due to computational limitations, JTFS-based sound matching is evaluated only once.
Below, we provide listening examples for aligned (micro, Table 1) and unaligned (meso, Table 2) unsupervised sound matching for a DDSP synth implementation of the Roland TR-808 drum machine.
Please refer to Sections 4.4 and 5.3 of the paper for details.
There are four different drum types: snare drum (SD), bass drum (BD), tom (Tom), and hi-hat (HH).
We note that this particular DDSP synthesizer implementation has been designed to only generate snare samples, so some bass drum, tom, and especially hi-hat analog target samples may go beyond the limitations of this synthesizer architecture, resulting in perceptually dissimilar reconstruction.
Tables can be randomized by clicking the button beneath them.
Loudness peak normalization (applied to original samples individually as a post-processing step) can be toggled on and off.
Table 1: Aligned (micro) unsupervised Roland TR-808 sound matching listening samples.
| Idx | Drum Type |
Target | JTFS | SCRAPL | SCRAPL (no θ-IS) |
MSS Linear |
MSS Log + Lin. |
MSS Random |
MSS Revisited |
|---|---|---|---|---|---|---|---|---|---|
| 42 | SD | ||||||||
| 94 | BD | ||||||||
| 21 | Tom | ||||||||
| 7 | HH |
Table scrolls horizontally if space is limited.
Table 2: Unaligned (meso) unsupervised Roland TR-808 sound matching listening samples.
| Idx | Drum Type |
Target | JTFS | SCRAPL | SCRAPL (no θ-IS) |
MSS Linear |
MSS Log + Lin. |
MSS Random |
MSS Revisited |
|---|---|---|---|---|---|---|---|---|---|
| 22 | SD | ||||||||
| 67 | BD | ||||||||
| 53 | Tom | ||||||||
| 57 | HH |
Table scrolls horizontally if space is limited.