Click here to flash read.
Variational Autoencoders (VAEs) have proven to be effective models for
producing latent representations of cognitive and semantic value. We assess the
degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's
chorales define latent spaces representative of the circle of fifths and the
hierarchical relation of each key component pitch as drawn in music cognition.
In detail, we compare the latent space of different VAE corpus encodings --
Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, and pitch class distributions --
in providing a pitch space for key relations that align with cognitive
distances. We evaluate the model performance of these encodings using objective
metrics to capture accuracy, mean square error (MSE), KL-divergence, and
computational cost. The ABC encoding performs the best in reconstructing the
original data, while the Pitch DFT seems to capture more information from the
latent space. Furthermore, an objective evaluation of 12 major or minor
transpositions per piece is adopted to quantify the alignment of 1) intra- and
inter-segment distances per key and 2) the key distances to cognitive pitch
spaces. Our results show that Pitch DFT VAE latent spaces align best with
cognitive spaces and provide a common-tone space where overlapping objects
within a key are fuzzy clusters, which impose a well-defined order of
structural significance or stability -- i.e., a tonal hierarchy. Tonal
hierarchies of different keys can be used to measure key distances and the
relationships of their in-key components at multiple hierarchies (e.g., notes
and chords). The implementation of our VAE and the encodings framework are made
available online.
No creative common's license