Click here to flash read.
Autoregressive models based on Transformers have become the prevailing
approach for generating music compositions that exhibit comprehensive musical
structure. These models are typically trained by minimizing the negative
log-likelihood (NLL) of the observed sequence in an autoregressive manner.
However, when generating long sequences, the quality of samples from these
models tends to significantly deteriorate due to exposure bias. To address this
issue, we leverage classifiers trained to differentiate between real and
sampled sequences to identify these failures. This observation motivates our
exploration of adversarial losses as a complement to the NLL objective. We
employ a pre-trained Span-BERT model as the discriminator in the Generative
Adversarial Network (GAN) framework, which enhances training stability in our
experiments. To optimize discrete sequences within the GAN framework, we
utilize the Gumbel-Softmax trick to obtain a differentiable approximation of
the sampling process. Additionally, we partition the sequences into smaller
chunks to ensure that memory constraints are met. Through human evaluations and
the introduction of a novel discriminative metric, we demonstrate that our
approach outperforms a baseline model trained solely on likelihood
maximization.
No creative common's license