Aliasing-Free Neural Audio Synthesis

Abstract

Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in both inference speed and synthesis quality, achieving state-of-the-art performance. Still, despite their success in producing perceptually natural sound, their synthesis fidelity remains limited due to the aliasing artifacts brought by the inadequately designed model architectures. In particular, the unconstrained nonlinear activation generates an infinite number of harmonics that exceed the Nyquist frequency, resulting in ``folded-back'' aliasing artifacts. The widely used upsampling layer, ConvTranspose, copies the mirrored low-frequency parts to fill the empty high-frequency region, resulting in ``mirrored'' aliasing artifacts. Meanwhile, the combination of its inherent periodicity and the mirrored DC bias also brings ``tonal artifact,'' resulting in constant-frequency ringing. This paper aims to solve these issues from a signal processing perspective. Specifically, we apply oversampling and anti-derivative anti-aliasing to the activation function to obtain its anti-aliased form, and replace the problematic ConvTranspose layer with resampling to avoid the ``tonal artifact'' and eliminate aliased components. Based on our proposed anti-aliased modules, we introduce Pupu-Vocoder and Pupu-Codec, and release high-quality pre-trained checkpoints to facilitate audio generation research. We build a test signal benchmark to illustrate the effectiveness of the anti-aliased modules, and conduct experiments on speech, singing voice, music, and audio to validate our proposed models. Experimental results confirm that our lightweight Pupu-Vocoder and Pupu-Codec models can easily outperform existing systems on singing voice and music, while achieving comparable performance on speech and audio.

Tutorial

What is Aliasing?

In signal processing, aliasing is a phenomenon in which a reconstructed signal contains frequency components that are not present in the original signal. Aliasing occurs due to the constraint of discrete sampling. According to the Fourier Theorem, any periodic signal can be represented as a sum of sinusoidal components with different frequencies, amplitudes, and phases. Suppose we have a fixed sampling frequency, when the frequency of the sinusoidal component is minor, we can accurately reconstruct the original signal from the discretely sampled points; when the frequency of the sinusoidal component exceeds the Nyquist frequency, which is half of the sampling frequency, we cannot reconstruct the original signal from the discretely sampled points, instead, we reconstruct it as a sinusoidal component with a lower frequency that is not present in the original signal. This phenomenon is known as aliasing, as illustrated above.

Why does Aliasing Occur?

The existing state-of-the-art (SOTA) vocoder and codec systems are upsampling-based time-domain models. In particular, they take in a spectrogram or a spectrogram-like latent representation, convert it into a multi-channel waveform template, and then repeat the process of upsampling and channel fusion while applying ConvBlocks and activations to the template until it reaches the exact same resolution as the waveform, as illustrated in the figure above.

However, from the view of digital signal processing, such a paradigm has numerous inherent problems that hinder the upper limit regarding model performance. This is a limitation in the architectural design itself, which cannot be overcome regardless of how large we scale the dataset size or model capacity.

Let us discuss some of the significant artifacts brought by these inadequately designed model architectures below:

"Folded-Back" Aliasing

Firstly, the use of unconstrained nonlinear activation functions introduces an infinite number of harmonic components, where the harmonic components with frequencies higher than the Nyquist frequency become "folded-back" aliasing artifacts. Following Wavehax, we take the ReLU activation as an example. Suppose the input signal is a sine wave with angular frequency \(\omega\) for continuous time \(t \in R\), after applying ReLU, the resulting signal's Fourier expansion becomes:

$$ \text{relu}(\sin(\omega t)) = \frac{1}{\pi} + \frac{\sin(\omega t)}{2} - \sum_{k=1}^{\infty} \frac{2\cos(2k \omega t)}{\pi(2k - 1)(2k + 1)}, $$
where the last term induces an infinite amount of harmonics. The frequency components higher than the Nyquist frequency, where\(\frac{k \omega }{\pi} > \frac{F_{N}}{2}\), would become the aliasing artifacts, which is illustrated as the orange pitch contour in the figure above.

"Mirrored" Aliasing

The widely used upsampling layer, ConvTranspose, is implemented equivalent to interlacing zeros into the input signal and then applying channel fusion convolutions. In the time-frequency domain, such a process is equivalent to copying the mirrored low-frequency components to fill the empty high-frequency region. Since the mirrored frequency components do not exist in the original signal, they become harsh noises that significantly degrade the signal. This phenomenon is known as the "mirrored" aliasing artifacts, which is the green pitch contour in the figure above.

"Tonal Artifact"

The ConvTranspose layer also suffers from the ``tonal artifact'', resulting in constant-frequency ringing as the purple contour in the figure above. Such a phenomenon originates from two sources. Firstly, the DC bias introduced by the non-linear activation functions or the bias parameters from the convolution layers (pink contour in the figure above) is mirrored into high-frequency bands; meanwhile, the computational process of the ConvTranspose has an inherent periodicity due to its fixed stride and shared weights, introducing constant-frequency ringing at the same locations as the mirrored DC bias. These two sources combine together and generate "stationary wave"-like harsh noises. For a more detailed analysis of this problem, please refer to this paper and this paper.

"Filter Artifact"

To address the "tonal artifact", previous work proposes using linear and nearest interpolation layers as alternatives, since they do not exhibit inherent periodicity and their operations are equivalent to low-pass filtering, which can simultaneously remove the mirrored DC bias in the high-frequency region. However, such a replacement does not effectively eliminate the "mirrored" aliasing artifacts, and it will also introduce the "filter artifact" due to their poor filter frequency responses, resulting in quality degradation. The above figure illustrates the comparison between the ideal and equivalent filter frequency responses regarding the two interpolation layers. It can be observed that the frequency responses of these interpolation layers deviate significantly from the ideal one. In particular, the slow roll-off in the pass-band causes an attenuation of the valid frequency region that should be preserved, represented by the red hatched regions. Meanwhile, the insufficient suppression in the stop-band fails to eliminate the "mirrored" aliasing artifacts, indicated by the blue hatched regions. This phenomenon, where the filter fails to preserve the valid frequency region while incompletely removing aliasing artifacts, is known as the "filter artifact."

Methodology

To achieve a higher synthesis fidelity, this work introduces anti-aliased and artifact-free activation and upsampling modules, as we discussed following:
Anti-Aliased Activation Function

We use the oversampling technique and apply anti-derivative anti-aliasing (ADAA) to the activation function to obtain its anti-aliased form.

Oversampling

Oversampling is a technique that temporally increases the Nyquist frequency before applying the non-linear activation function, and then downsamples the signal back after filtering out the unwanted frequency components using a low-pass filter, as illustrated in the figure above. To achieve a good trade-off between quality and efficiency, our proposed models use an oversampling factor of two.

Anti-Derivative Anti-Aliasing (ADAA)

Anti-derivative anti-aliasing (ADAA) converts the discrete signal to a continuous one before applying the activation function, thereby overcoming the sampling frequency constraint and avoiding the aliasing artifacts. Such a signal can then be low-pass filtered to remove the extra frequency region and discretely resampled back. To implement ADAA, suppose we have a signal \(x\) for continuous time \(t \in [0, n]\), it continuous version \(\widetilde{x}\) can be obtained via:

$$ \widetilde{x}(t) = \begin{cases} x_1 + \tau (x_0 - x_1), & \text{if } 0 \le |t| < 1 \\ \vdots & \\ x_n + \tau (x_{n - 1} - x_n), & \text{if } n - 1 \le |t| < n \end{cases} $$
where \(\tau = 1 - (t \mod 1)\) is a time variable that runs \(1...0\) between each sample. Applying the activation \(f(\cdot)\) to the signal \(\widetilde{x}\), followed by low-pass filtering with a filter kernel \(h(\cdot)\), and then discretely resampling the signal back, gives the following:
$$ y_t = \int_{-\infty}^{\infty}h(u)f(\widetilde{x}(t - u))du, $$
where \(h(\cdot)\) is a rectangular filter kernel with unit width:
$$ h(t) = \begin{cases} 1, & \text{if } 0 \leq t \leq 1 \\ 0, & \text{otherwise} \\ \end{cases} $$
Following the derivations in the original paper, the integral in the above equation can be reduced to a closed-form expression as follows:
$$ y_t = \frac{F(x_t) - F(x_{t - 1})}{x_t - x_{t - 1}}, $$
where \(F( \cdot )\) is the first order anti-derivative of the activation function, and \(y\) is the output signal. Following BigVGAN and DAC, We use SnakeBeta as the activation function in our work to utilize is periodic nature, which can be written as:
$$ f(x) = x + \frac{\sin(\alpha x)^2}{\beta}, $$
which, applying ADAA gives the following closed form:
$$ y_t = \frac{1}{2\beta} + \frac{x_t + x_{t - 1}}{2} - \frac{\cos(\alpha(x_t + x_{t - 1}))\text{sinc}(\alpha(x_t - x_{t - 1}))}{2\beta}, $$
in which the detailed derivative process and analysis on its gradient stability can be found in our paper.

Anti-Aliased Upsampling Layer

We replace the ConvTranspose with resampling (zero-interlacing + low-pass filter) to avoid the "tonal artifact" and suppress the aliased components. We apply a channel expansion Conv1D layer with a high-pass filter to convert the zero-interlaced \(x_0\) to a noise-like deterministic prior, which is used to fill the empty high-frequency region of the upsampled signal to improve the training stability (\(x_0\) is the latent representation obtained from the first Conv1D layer in the decoder). The added-up full-band signal is then fed to a channel expansion Conv1D layer to get the layer output.

Pupu-Vocoder and Pupu-Codec

We introduce the Pupu-Vocoder and Pupu-Codec models based on our proposed anti-aliased activation and upsampling modules. The models are modified from DAC and BigVGAN, in which we replace the activation and upsampling modules in the decoder with our proposed ones, and utilize four different discriminators for better synthesis quality. The above figure illustrates the architecture and training scheme of the proposed Pupu-Codec model. Replacing the waveform input, encoder, and residual vector quantizer (RVQ) module with a mel-spectrogram as the input gives the Pupu-Vocoder model.

Analysis-Synthesis

We conduct analysis-synthesis on speech, singing voice, music, and audio to illustrate the effectiveness of our proposed Pupu-Vocoder and Pupu-Codec models. We use Vocos as a reference system for the performance of time-frequency domain-based models, which generate a time-frequency representation instead and use its inverse transform to obtain the waveform. For the neural vocoder, we use HiFi-GAN and BigVGAN as the baseline systems. For the neural codec, we use Encodec, DAC, and BigCodec as the baseline systems. Representative examples can be found below:

💡
Click an image in the table to zoom in and view in high resolution. Click the "X" in the top-right corner to close.
Neural Vocoder
Domain Setting GT Vocos
(ICLR 2024)
HiFi-GAN
(NeurIPS 2020)
BigVGAN\(_\text{small}\)
(ICLR 2023)
BigVGAN\(_\text{large}\)
(ICLR 2023)
Pupu-Vocoder\(_\text{small}\) Pupu-Vocoder\(_\text{large}\)
Speech Academic
Speech Industrial
Singing Voice Academic
Singing Voice Industrial
Music Academic
Music Industrial
Audio Academic
Audio Industrial
Neural Codec
Domain Setting GT Vocos
(ICLR 2024)
Encodec
(TMLR 2023)
DAC
(NeurIPS 2023)
BigCodec
(2024)
Pupu-Codec\(_\text{small}\) Pupu-Codec\(_\text{large}\)
Speech Academic
Speech Industrial
Singing Voice Academic
Singing Voice Industrial
Music Academic
Music Industrial
Audio Academic
Audio Industrial

Dynamic Bitrate Encoding

We conduct experiments on dynamic bitrate encoding to explore the effectiveness of our Pupu-Codec models.

💡
Click an image in the table to zoom in and view in high resolution. Click the "X" in the top-right corner to close.
Bitrate GT Vocos
(ICLR 2024)
Encodec
(TMLR 2023)
DAC
(NeurIPS 2023)
BigCodec
(2024)
Pupu-Codec\(_\text{small}\) Pupu-Codec\(_\text{large}\)
8 kbps
5.33 kbps
2.67 kbps
1.78 kbps

Ablation Study

We conduct ablation study to illustrate the effectiveness of our proposed anti-aliased activation and upsampling modules.

💡
Click an image in the table to zoom in and view in high resolution. Click the "X" in the top-right corner to close.

GT Pupu-Vocoder\(_\text{small}\) w/o Oversampling Ours \(\rightarrow\) LeakyReLU Ours \(\rightarrow\) ELU Ours \(\rightarrow\) SnakeBeta w/o Deterministic Prior Ours \(\rightarrow\) ConvTranspose Ours \(\rightarrow\) Linear Interpolation Ours \(\rightarrow\) Nearest Interpolation