Click here to flash read.
This paper presents a novel neural vocoder named APNet which reconstructs
speech waveforms from acoustic features by predicting amplitude and phase
spectra directly. The APNet vocoder is composed of an amplitude spectrum
predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a residual
convolution network which predicts frame-level log amplitude spectra from
acoustic features. The PSP also adopts a residual convolution network using
acoustic features as input, then passes the output of this network through two
parallel linear convolution layers respectively, and finally integrates into a
phase calculation formula to estimate frame-level phase spectra. Finally, the
outputs of ASP and PSP are combined to reconstruct speech waveforms by inverse
short-time Fourier transform (ISTFT). All operations of the ASP and PSP are
performed at the frame level. We train the ASP and PSP jointly and define
multilevel loss functions based on amplitude mean square error, phase
anti-wrapping error, short-time spectral inconsistency error and time domain
reconstruction error. Experimental results show that our proposed APNet vocoder
achieves an approximately 8x faster inference speed than HiFi-GAN v1 on a CPU
due to the all-frame-level operations, while its synthesized speech quality is
comparable to HiFi-GAN v1. The synthesized speech quality of the APNet vocoder
is also better than that of several equally efficient models. Ablation
experiments also confirm that the proposed parallel phase estimation
architecture is essential to phase modeling and the proposed loss functions are
helpful for improving the synthesized speech quality.