Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder
Yicheng Gu, Xueyao Zhang, Liumeng Xue, Zhizheng Wu
School of Data Science, The Chinese University of Hong Kong, Shenzhen
Overview
Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency resolution in a spectrogram is fixed, making it incompatible with signals like singing voices that require flexible attention for different frequency bands. Motivated by that, our study utilizes the Constant-Q Transform (CQT), which owns dynamic resolution among frequencies, contributing to a better modeling ability in pitch variation and harmonic tracking. Specifically, we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates on the CQT spectrogram at multiple scales and performs sub-band processing according to different octaves. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed method. Moreover, we also verified that the CQT-based and the STFT-based discriminators could be complementary under joint training.
Architecture of the proposed Multi-Scale Sub-Band Constant-Q Transform (MS-SB-CQT)
Discriminator, which can be integrated with any GAN-based vocoder. Operator ``C'' denotes
for concatenation. SBP means our proposed Sub-Band Processing module.
Effectiveness of MS-SB-CQT Discriminator
Table 1: Results of different discriminators when being integrated into
HiFi-GAN. The best and the second best results of every column (except those from
Ground Truth) in each domain (speech and singing voice) are bold and italic.
"S" and "C" represent MS-STFT and MS-SB-CQT Discriminators respectively. The MOS scores are
with 95% Confidence Interval (CI).
As illustrated in Table 1, regarding singing voice, we can observe that:
(1) both HiFi-GAN (+C) and HiFi-GAN (+S) perform better than HiFi-GAN, showing the
importance of
time-frequency-representation-based discriminators;
(2) HiFi-GAN (+C) performs better than HiFi-GAN (+S) with a significant boost in MOS,
showing
the superiority of our proposed
MS-SB-CQT Discriminator;
(3) HiFi-GAN (+S+C) performs best both objectively and subjectively, which shows that
different
discriminators will have complementary information for each other,
confirming the effectiveness of jointly training.
A similar conclusion can be drawn for the unseen speaker evaluation of speech data.
Here, we show some representative samples to reveal the effctiveness of our MS-SB-CQT Discriminator and the complementary role of the STFT-based and CQT-based Discriminators. A case study will also be attached at the bottom of this section to explore the effctiveness of jointly training.
Representative Samples
Seen Singers
Ground Truth | HiFi-GAN | HiFi-GAN (+S) | HiFi-GAN (+C) | HiFi-GAN (+S+C) | |
---|---|---|---|---|---|
#1 | |||||
#2 | |||||
#3 |
Unseen Singers
Ground Truth | HiFi-GAN | HiFi-GAN (+S) | HiFi-GAN (+C) | HiFi-GAN (+S+C) | |
---|---|---|---|---|---|
#1 | |||||
#2 | |||||
#3 |
Seen Speakers
Ground Truth | HiFi-GAN | HiFi-GAN (+S) | HiFi-GAN (+C) | HiFi-GAN (+S+C) | |
---|---|---|---|---|---|
#1 | |||||
#2 | |||||
#3 |
Unseen Speakers
Ground Truth | HiFi-GAN | HiFi-GAN (+S) | HiFi-GAN (+C) | HiFi-GAN (+S+C) | |
---|---|---|---|---|---|
#1 | |||||
#2 | |||||
#3 |
Case Study
We visualized the same singing voice utterance synthesized by generators trained with different discriminator combinations. It can be observed that:
(1) With only the time-domain-based discriminators, it is hard for our generator to model the high-frequency parts. Adding time-frequency-representation-based discriminators, neither MS-SB-CQT Discriminator (HiFi-GAN (+C)) or MS-STFT Discriminator (HiFi-GAN (+S)), significantly boosts the quality of high-frequency reconstruction.
(2) STFT has a fixed time-frequency resolution across all frequency bands. In the low-frequency parts, its lack of frequency resolution brings frequency distortions, resulting in phonemes with artifacts. In the high-frequency parts, its lack of time resolution limits it from reconstructing harmonic components. CQT has a dynamic resolution trade-off, thus alleviating these artifacts. However, its lack of time resolution in the low-frequency parts and the lack of frequency resolution in the high-frequency parts still bring problems like glitches and hissing noises.
(3) Combining STFT-based and CQT-based Discriminators integrates their strengths, thus attaining a significantly better synthesis quality.
Ground Truth | HiFi-GAN | HiFi-GAN (+S) | HiFi-GAN (+C) | HiFi-GAN (+S+C) |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
能不能给我一首歌的时间 | 能不能给我一首歌的时间 | 能不能给我一首歌的时间 | 能不能给我一首歌的时间 | 能不能给我一首歌的时间 |
néng bù néng gěi wǒ yī shǒu gē de shí jiān | néng bù néng gěi wǒ yī shǒu gē de shí jiān | néng bù néng gěi wǒ yī shǒu gē de shí jiān | néng bù néng gěi wǒ yī shǒu gē de shí jiān | néng bù néng gěi wǒ yī shǒu gē de shí jiān |
Generalization Ability of MS-SB-CQT Discriminator
Table 2: Results of our proposed MS-SB-CQT Discriminator when integrating in
MelGAN and NSF-HiFiGAN in singing voice datasets. The improvements are shown in
bold. "S" and "C" represent MS-STFT and MS-SB-CQT Discriminators respectively. All
the improvements in MCD, PESQ, and Preference are significant (p-value < 0.01).
As, illustrated in Table 2, in general, the performance of MelGAN and NSF-HiFiGAN can be improved significantly by jointly training with MS-SB-CQT and MS-STFT Discriminators, with both objective and subjective preference tests confirming the effectiveness.
Here, we show some representative samples to reveal the effctiveness of joint training with our proposed MS-SB-CQT and the existing MS-STFT Discriminators on MelGAN and NSF-HiFiGAN. Case studies are conducted and attached at the bottom of this section to explore the detailed boosts.
Representative Samples
MelGAN
Seen Singers
Ground Truth | MelGAN | MelGAN (+S+C) | |
---|---|---|---|
#1 | |||
#2 | |||
#3 |
Unseen Singers
Ground Truth | MelGAN | MelGAN (+S+C) | |
---|---|---|---|
#1 | |||
#2 | |||
#3 |
NSF-HiFiGAN
Seen Singers
Ground Truth | NSF-HiFiGAN | NSF-HiFiGAN (+S+C) | |
---|---|---|---|
#1 | |||
#2 | |||
#3 |
Unseen Singers
Ground Truth | NSF-HiFiGAN | NSF-HiFiGAN (+S+C) | |
---|---|---|---|
#1 | |||
#2 | |||
#3 |
Case Study
MelGAN
MelGAN tends to overfit the low-frequency part and ignore mid and high-frequency components, resulting in audible metallic noise. After adding MS-STFT and MS-SB-CQT Discriminators, it could model the global information of spectrogram better, remarkably increasing the synthesis quality.
Ground Truth | MelGAN | MelGAN (+S+C) |
---|---|---|
![]() |
![]() |
![]() |
NSF-HiFiGAN
NSF-HiFiGAN can synthesize high-fidelity singing voices. However, it still lacks frequency details. Adding MS-STFT and MS-SB-CQT Discrimina-tors tackles that problem, making synthesized samples closer to the ground truth.
Ground Truth | NSF-HiFiGAN | NSF-HiFiGAN (+S+C) |
---|---|---|
![]() |
![]() |
![]() |
Necessity of Sub-Band Processing
Table 3: Results of HiFi-GAN enhanced by different CQT-based
discriminators. MS-CQT Discriminator represents a discriminator
that only removes the Sub-Band Processing module from our proposed MS-SB-CQT Discriminator.
As, illustrated in Table 3, we can see that HiFi-GAN can be enhanced successfully by our proposed MS-SB-CQT Discriminator. However, just applying the raw CQT to the discriminator (MS-CQT) would even harm the quality of HiFi-GAN. We speculate this is because the temporal desynchronization in inter-octaves of the raw CQT would burden the model learning. Therefore, it is necessary to adopt the proposed SBP module for designing a CQT-based discriminator.
Here, we show some representative cases to reveal the effctiveness of our proposed SBP module.
Ground Truth | HiFi-GAN | HiFi-GAN (+MS-CQT) | HiFi-GAN (+MS-SB-CQT) | |
---|---|---|---|---|
#1 | ||||
#2 | ||||
#3 |