Abstract:
Although Coordinate-MLP-based implicit neural representations have excelled in
representing radiance
fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this
gap, we investigate existing implicit neural representations, from which we extract 3 types of
positional encoding and 16 commonly used activation functions. Through combinatorial design, we
establish the first benchmark for Coordinate-MLPs in audio signal representations. Our benchmark
reveals
that Coordinate-MLPs require complex hyperparameter tuning and frequency-dependent initialization,
limiting their robustness. To address these issues, we propose Fourier-ASR, a novel framework
based on
the Fourier series theorem and the Kolmogorov-Arnold representation theorem. Fourier-ASR introduces
Fourier Kolmogorov-Arnold Networks (Fourier-KAN), which leverage periodicity and strong nonlinearity to
represent audio signals, eliminating the need for additional positional encoding. Furthermore, a
Frequency-adaptive Learning Strategy (FaLS) is proposed to enhance the convergence of Fourier-KAN by
capturing high-frequency components and preventing overfitting of low-frequency signals. Extensive
experiments conducted on natural speech and music datasets reveal that: (a) well-designed positional
encoding and activation functions in Coordinate-MLPs can effectively improve audio representation
quality; and (b) Fourier-ASR can robustly represent complex audio signals without extensive
hyperparameter tuning. Looking ahead, the continuity and infinite resolution of implicit audio
representations make our research highly promising for tasks such as audio compression, synthesis, and
generation. The source code will be released publicly to ensure reproducibility.
In contrast to local feature-based representation, Neural Amplitude Fields, as a specific instance of implicit neural representations in audio signals, utilize time coordinates as inputs to regress the corresponding amplitudes, thereby encoding the audio signal within the weights of a neural network. This parameterized representation is not only continuously differentiable but also decoupled from spatial resolution, allowing for precise processing of audio signals at any resolution. It holds potential for applications in audio denoising, synthesis, generation, and other related fields.
Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations.
Conclusions:
RFF
makes it more suited to Gaussian-type activation functions (~3dB ↑ in SNR). Conversely, NeFF
employs Fourier
mappings, which are more compatible with Sine-type activation functions (~9dB ↑ in SNR).
SNR=13.36 dB
SNR=39.02 dB
SNR=42.39 dB
SNR=15.98 dB
SNR=38.10 dB
SNR=41.40 dB
SNR=6.35 dB
SNR=20.85 dB
SNR=19.68 dB
SNR=12.04 dB
SNR=15.57 dB
SNR=15.26 dB
SNR=6.38 dB
SNR=20.86 dB
SNR=19.69 dB
SNR=0.00 dB
SNR=15.62 dB
SNR=22.29 dB
SNR=7.96 dB
SNR=13.06 dB
SNR=33.58 dB
SNR=8.16 dB
SNR=12.86 dB
SNR=32.24 dB
SNR=0.74 dB
SNR=12.14 dB
SNR=9.20 dB
SNR=1.34 dB
SNR=10.97 dB
SNR=8.67 dB
SNR=0.75 dB
SNR=12.44 dB
SNR=9.20 dB
SNR=-7.66 dB
SNR=4.93 dB
SNR=9.57 dB
To avoid spectral bias from positional encoding and complex parameter tuning of activation functions, we propose a novel audio signal representation framework, Fourier-ASR, based on the Fourier series theorem and the Kolmogorov-Arnold theorem. Fourier-ASR includes Fourier Kolmogorov-Arnold Networks (Fourier-KAN) and a Frequency-adaptive Learning Strategy (FaLS). Due to the periodicity and strong nonlinearity of Fourier basis functions, Fourier-ASR can effectively represent audio signals and provide enhanced interpretability
Conclusions:
RFF+Gaussian
and NeFF+Sine
, to address
these challenges.
RFF+Gaussian
and NeFF+Sine
significantly enhance the
ability of Coordinate-MLPs to represent audio signals. On the GTZAN dataset, these methods
improve the SNR by 10.15dB ↑ and 12.28dB ↑, respectively. On the CSTR VCTK dataset, the SNR
improvements are 10.40dB ↑ and 14.04dB ↑, respectively.
Bach
SNR=20.85 dB
SNR=42.39 dB
SNR=33.14 dB
Counting
SNR=12.14 dB
SNR=33.58 dB
SNR=20.10 dB
Blues (GTZAN)
SNR=11.80 dB
SNR=22.02 dB
SNR=13.80 dB
Classical (GTZAN)
SNR=10.76 dB
SNR=25.95 dB
SNR=15.05 dB
NorthernIrish (VCTK)
SNR=16.19 dB
SNR=19.59 dB
SNR=17.12 dB
NewZealand (VCTK)
SNR=13.32 dB
SNR=16.87 dB
SNR=15.79 dB