An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder


Yicheng Gu, Xueyao Zhang, Liumeng Xue, Haizhou Li, Zhizheng Wu

School of Data Science, The Chinese University of Hong Kong, Shenzhen

[Paper] [Code]

Abstract

Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.

Approach


MS-SB-CQT Discriminator

As illustrated below, the MS-SB-CQT Discriminator consists of identically structured sub-discriminators operating on CQT spectrograms with different time-frequency resolution distributions. In each sub-discriminator, the real and imaginary parts of the CQT spectrogram will be sent individually to our proposed Sub-Band Processing (SBP) module to get their latent representations. These two representations will then be concatenated and sent to convolutional layers to get the outputs for computing loss.


Architecture of the Sub-Discriminator in MS-SB-CQT Discriminator. Operator ''C'' denotes for concatenation. SBP means our proposed Sub-Band Processor module. It can be observed that the desynchronized CQT Spectrogram (bottom-right) has been synchronized (upper-right) after SBP.


MS-MW-CWT Discriminator

As illustrated below, the MS-MW-CWT Discriminator consists of identically structured sub-discriminators operating on CWT spectrograms with different wavelets and different amounts of wavelet scales. In each sub-discriminator, the CWT spectrogram will be sent to a series of Conv2d layers that only operate on the time dimension to obtain its time-domain compressed representation. The processed representation will then be sent to convolutional layers to get the outputs for computing loss.


Architecture of the Sub-Discriminator in MS-TC-CWT Discriminator. Operator ''C'' denotes for concatenation. TC means our proposed Temporal Compressor module. Comp is a series of temporal-overlapped convolution layers. K is the total number of frequency bins. It can be observed that the CWT Spectrogram (bottom-right) can be compressed while maintaining the overall energy distribution over different frequency bins (upper-right).

Effectiveness of the Proposed Discriminators and Using Them Jointly


Analysis-synthesis results of different discriminators when being integrated into HiFi-GAN. The best and the second best results of every column (except those from Ground Truth) in each domain (speech and singing voice) are bold and italic. "S", "C" and "W" represent MS-STFT, MS-SB-CQT and MS-TC-CWT Discriminators respectively. The MOS scores are with 95% Confidence Interval (CI).

As illustrated above, regarding singing voice, we can observe that:
(1) HiFi-GAN (+S), HiFi-GAN (+C), and HiFi-GAN (+W) all outperform HiFi-GAN both subjectively and objectively, confirming the importance of the extra adversarial losses in the frequency domain;
(2) Both HiFi-GAN (+C) and HiFi-GAN (+W) outperform the HiFi-GAN (+S) objectively and subjectively, illustrating the effectiveness of utilizing TFRs with dynamic TF resolution;
(3) HiFi-GAN (+C) outperforms HiFi-GAN (+W) objectively especially on F0-related metrics, showing the effectiveness of the pitch-aware center frequency distribution. HiFi-GAN (+W) outperforms HiFi-GAN (+C) subjectively, showing the effectiveness of the diverse energy-centered wavelet basis;
(4) HiFi-GAN (+S+C+W) outperforms both objectively and subjectively on seen singers while holding better objective results with a similar subjective score on unseen singers, confirming the effectiveness of joint training.
A similar conclusion can be drawn for the unseen speaker evaluation of speech data.

Here, we show some representative samples to reveal the effctiveness of our proposed methods and the complementary role of the discriminators based on different TFRs.

Representative Samples

Seen Singers

Ground Truth HiFi-GAN HiFi-GAN (+S) HiFi-GAN (+C) HiFi-GAN (+W) HiFi-GAN (+S+C+W)
#1
#2
#3


Unseen Singers

Ground Truth HiFi-GAN HiFi-GAN (+S) HiFi-GAN (+C) HiFi-GAN (+W) HiFi-GAN (+S+C+W)
#1
#2
#3


Seen Speakers

Ground Truth HiFi-GAN HiFi-GAN (+S) HiFi-GAN (+C) HiFi-GAN (+W) HiFi-GAN (+S+C+W)
#1
#2
#3


Unseen Speakers

Ground Truth HiFi-GAN HiFi-GAN (+S) HiFi-GAN (+C) HiFi-GAN (+W) HiFi-GAN (+S+C+W)
#1
#2
#3


Effectiveness of Proposed Training Strategy


Analysis-synthesis results of our proposed Discriminators when applying on NSF-HiFiGAN, BigVGAN, and APNet in singing voice datasets. The improvements are shown in bold. "S", "C" and "W" represent MS-STFT, MS-SB-CQT and MS-TC-CWT Discriminators respectively.

Here, we show some representative samples to reveal the effctiveness of joint training with our proposed discriminators on different generators.

Representative Samples

NSF-HiFiGAN

Seen Singers

Ground Truth NSF-HiFiGAN NSF-HiFiGAN (+S+C+W)
#1
#2
#3


Unseen Singers

Ground Truth NSF-HiFiGAN NSF-HiFiGAN (+S+C+W)
#1
#2
#3


BigVGAN

Seen Singers

Ground Truth BigVGAN BigVGAN (+S+C+W)
#1
#2
#3


Unseen Singers

Ground Truth BigVGAN BigVGAN (+S+C+W)
#1
#2
#3


APNet

Seen Singers

Ground Truth APNet APNet (+S+C+W)
#1
#2
#3


Unseen Singers

Ground Truth APNet APNet (+S+C+W)
#1
#2
#3


Ablation Studies


Analysis-synthesis results of HiFi-GAN enhanced by discriminators with varied modules and processing techniques. The improvements are shown in bold. "C" and "W" represent MS-SB-CQT and MS-TC-CWT Discriminators respectively.

Here, we show some representative cases to reveal the effctiveness of our proposed methods. We use HiFi-GAN (w/o) to denote the HiFi-GAN trained without our proposed methods.

Ground Truth HiFi-GAN HiFi-GAN (w/o) HiFi-GAN (+W)
#1
#2
#3
Ground Truth HiFi-GAN HiFi-GAN (w/o) HiFi-GAN (+C)
#1
#2
#3